You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Docs 1809: lazy dataframe documentation and example notebook (#1815)
#### Reference Issues/PRs
Closes#1809
#### What does this implement or fix?
Adds documentation and an example notebook for the new `LazyDataFrame`
API for processing operations. Also updated existing example notebooks
to use the `LazyDataFrame` instead of `QueryBuilder`, and the
`QueryBuilder` demo notebook to refer to the `LazyDataFrame` API being
preferred.
Also tweaked the docstrings for `LazyDataFrame` and
`LazyDataFrameCollection` to improve rendering and remove duplication of
`QueryBuilder` docstrings.
| 4000 | The number, type, or name of the columns has been changed. | Ensure that the type and order of the columns has not changed when appending or updating the previous version. This restriction only applies when `Dynamic Schema` is disabled - if you require the columns sets to change, please enable the `Dynamic Schema` option on your library. |
46
-
| 4001 | The specified column does not exist. | Please specify a valid column - use the `get_description` method to see all of the columns associated with a given symbol. |
47
-
| 4002 | The requested operation is not supported with the type of column provided. | Certain operations are not supported over all column types e.g. arithmetic with the `QueryBuilder`over string columns - use the `get_description` method to see all of the columns associated with a given symbol, along with their types. |
48
-
| 4003 | The requested operation is not supported with the index type of the symbol provided. | Certain operations are not supported over all index types e.g. column statistics generation with a string index - use the `get_description` method to see the index(es) associated with a given symbol, along with their types. |
49
-
| 4004 | The requested operation is not supported with pickled data. | Certain operations are not supported with pickled data e.g. `date_range` filtering. If such operations are required, you must ensure that the data is of a normalizable type, such that it can be written using the `write` method, and does not require the `write_pickle` method. |
46
+
| 4001 | The specified column does not exist. | Please specify a valid column - use the `get_description` method to see all of the columns associated with a given symbol. |
47
+
| 4002 | The requested operation is not supported with the type of column provided. | Certain operations are not supported over all column types e.g. arithmetic in the processing pipeline over string columns - use the `get_description` method to see all of the columns associated with a given symbol, along with their types.|
48
+
| 4003 | The requested operation is not supported with the index type of the symbol provided. | Certain operations are not supported over all index types e.g. column statistics generation with a string index - use the `get_description` method to see the index(es) associated with a given symbol, along with their types. |
49
+
| 4004 | The requested operation is not supported with pickled data. | Certain operations are not supported with pickled data e.g. `date_range` filtering. If such operations are required, you must ensure that the data is of a normalizable type, such that it can be written using the `write` method, and does not require the `write_pickle` method. |
50
50
51
51
52
52
### Storage Errors
@@ -92,7 +92,7 @@ For legacy reasons, the terms `symbol`, `stream`, and `stream ID` are used inter
92
92
93
93
These errors relate to data being pickled, which limits the operations available. Internally, pickled symbols are stored as opaque, serialised binary blobs in the [data layer](technical/on_disk_storage.md#data-layer). No index or column information is maintained in this serialised object which is in contrast to non-pickled data, where this information is stored in the [index layer](technical/on_disk_storage.md#index-layer).
94
94
95
-
Furthermore, it is not possible to partially read/update/append the data using the ArcticDB API or use the QueryBuilder with pickled symbols.
95
+
Furthermore, it is not possible to partially read/update/append the data using the ArcticDB API or use the processing pipeline with pickled symbols.
96
96
97
97
All of these errors are of type `arcticdb.exceptions.ArcticException`.
98
98
@@ -135,20 +135,20 @@ All of these errors are of type `arcticdb.exceptions.ArcticException`.
135
135
| Non-contiguous rows, range search on unsorted data?... |`read` method called with the optional `date_range` argument specified, and the symbol has a timestamp index, but it is not sorted. | To use the `date_range` argument to `read`, the user must ensure the data is sorted on the index at write time. |
136
136
| Delete in range will not work as expected with a non-timeseries index |`delete_data_in_range` method called, but the symbol does not have a timestamp index. | None, the `delete_data_in_range` method does not make sense without a timestamp index. |
137
137
138
-
### QueryBuilder errors
138
+
### Processing pipeline errors
139
139
140
-
Due to the client-only nature of ArcticDB, it is not possible to know if a `QueryBuilder`provided to `read` makes sense for the given symbol without interacting with the storage. In particular, we do not know:
140
+
Due to the client-only nature of ArcticDB, it is not possible to know if a processing operation applied to a `LazyDataFrame`, or provided to `read` with a `QueryBuilder` object, makes sense for the given symbol without interacting with the storage. In particular, we do not know:
141
141
142
142
* Whether a specified column exists
143
143
* What the type of the data held in a specified column is if it does exist
144
144
145
145
All of these errors are of type `arcticdb.exceptions.ArcticException`.
146
146
147
-
| Error messages | Cause | Resolution |
148
-
|:--------------|:-------|:-----------|
149
-
| Unexpected column name | A column name was specified with the `QueryBuilder`that does not exist for this symbol, and the library has dynamic schema disabled. | None of the supported `QueryBuilder` operations (filtering, projections, group-bys and aggregations) make sense with non-existent columns.|
150
-
| Non-numeric type provided to binary operation: <typename\>| Error messages like this imply that an operation that ArcticDB does not support was provided in the `QueryBuilder` argument e.g. adding two string columns together. | The `get_description` method can be used to inspect the types of the columns. A full list of supported operations are provided in the `QueryBuilder`[API documentation](api/query_builder.md). |
151
-
| Cannot compare <typename 1\> to <typename 2\> (possible categorical?) | If `get_description` indicates that a column is of categorical type, and this categorical is being used to store string values, then comparisons to other strings will fail with an error message like this one. | Categorical support in ArcticDB is [extremely limited](faq.md#does-arcticdb-support-categorical-data), but may be added in the future. |
| Unexpected column name | A column name was specified for a processing operation that does not exist for this symbol, and the library has dynamic schema disabled. | Use `get_description` to ensure that column names provided in processing operations exist for the symbol. |
150
+
| Non-numeric type provided to binary operation: <typename\>| Error messages like this imply that an operation that ArcticDB does not support was provided in a processing operation e.g. adding two string columns together. | The `get_description` method can be used to inspect the types of the columns. A full list of supported operations are provided in the `QueryBuilder`[API documentation](api/processing.md). |
151
+
| Cannot compare <typename 1\> to <typename 2\> (possible categorical?) | If `get_description` indicates that a column is of categorical type, and this categorical is being used to store string values, then comparisons to other strings will fail with an error message like this one. | Categorical support in ArcticDB is [extremely limited](faq.md#does-arcticdb-support-categorical-data), but may be added in the future. |
Copy file name to clipboardexpand all lines: docs/mkdocs/docs/faq.md
+4-2
Original file line number
Diff line number
Diff line change
@@ -105,7 +105,7 @@ Note that this is a library configuration option that is off by default, see [`h
105
105
106
106
ArcticDB is primarily focused on filtering and transfering data from storage through to memory - at which point Pandas, NumPy, or other standard analytical packages can be utilised for analytics.
107
107
108
-
That said, ArcticDB does offer a limited set of analytical functions that are executed inside the C++ storage engine offering significant performance benefits over Pandas. For more information, see the documentation for the *QueryBuilder* class.
108
+
That said, ArcticDB does offer a limited set of analytical functions that are executed inside the C++ storage engine offering significant performance benefits over Pandas. For more information, see the [documentation](api/processing.md) for the `LazyDataFrame`, `LazyDataFrameCollection`, and `QueryBuilder` classes.
109
109
110
110
### *What does Pickling mean?*
111
111
@@ -172,7 +172,9 @@ Please see the [Runtime Configuration](runtime_config.md#versionstorenumcputhrea
172
172
173
173
### Does ArcticDB support categorical data?
174
174
175
-
ArcticDB currently offers extremely limited support for categorical data. Series and DataFrames with categorical columns can be provided to the `write` and `write_batch` methods, and will then behave as expected on `read`. However, `append` and `update` are not yet supported with categorical data, and will raise an exception if attempted. The `QueryBuilder` is also not supported with categorical data, and will either raise an exception, or give incorrect results, depending on the exact operations requested.
175
+
ArcticDB currently offers extremely limited support for categorical data. Series and DataFrames with categorical columns can be provided to the `write` and `write_batch` methods, and will then behave as expected on `read`.
176
+
However, `append` and `update` are not yet supported with categorical data, and will raise an exception if attempted.
177
+
Analytics such as filtering using the `LazyDataFrame` or `QueryBuilder` classes is also not supported with categorical data, and will either raise an exception, or give incorrect results, depending on the exact operations requested.
Copy file name to clipboardexpand all lines: docs/mkdocs/docs/index.md
+11-4
Original file line number
Diff line number
Diff line change
@@ -213,18 +213,25 @@ _output (the rows in the date range and columns requested)_
213
213
2000-01-01 13:00:00 18 8
214
214
```
215
215
216
-
#### Filtering
216
+
#### Filtering and Analytics
217
217
218
-
ArcticDB uses a Pandas-_like_ syntax to describe how to filter data. For more details including the limitations, please view the docstring ([`help(QueryBuilder)`](api/query_builder)).
218
+
ArcticDB supports many common DataFrame analytics operations, including filtering, projections, group-bys, aggregations, and resampling. The most intuitive way to access these operations is via the [`LazyDataFrame`](api/processing.md#arcticdb.LazyDataFrame) API, which should feel familiar to experienced users of Pandas or Polars.
219
219
220
-
!!! info "ArcticDB Filtering Philosphy & Restrictions"
220
+
The legacy [`QueryBuilder`](api/processing.md#arcticdb.QueryBuilder) class can also be created directly and passed into `read` calls with the same effect.
221
+
222
+
!!! info "ArcticDB Analytics Philosphy"
221
223
222
224
In most cases this is more memory efficient and performant than the equivalent Pandas operation as the processing is within the C++ storage engine and parallelized over multiple threads of execution.
0 commit comments