Releases: pathwaycom/pathway
Releases · pathwaycom/pathway
v0.27.0
Added
- JetStream extension is now supported in both NATS read and write connectors.
- The Iceberg connectors now support Glue as a catalog backend.
- New
Table.add_update_timestamp_utcfunction for tracking update time of rows in the table
Changed
- BREAKING The API for the Iceberg connectors has changed. The
catalogparameter is now required in bothpw.io.iceberg.readandpw.io.iceberg.write. This parameter can be either of typepw.io.iceberg.RestCatalogorpw.io.iceberg.GlueCatalog, and it must contain the connection parameters. - BREAKING
paddlepaddleis no longer a dependency of the Pathway package. The reason is that choosing a specific version for the hardware it will be run on is advantageous from the performance point of view. To installpaddlepaddlefollow instructions on https://www.paddlepaddle.org.cn/en/install/quick. pw.xpacks.llm.question_answering.BaseRAGQuestionAnswerernow supports document reranking. This enables two-stage retrieval where initial vector similarity search is followed by reranking to improve document relevance ordering.
Fixed
- Endpoints created by
pw.io.http.rest_connectornow accept requests both with and without a trailing slash. For example,/endpoint/and/endpointare now treated equivalently. - Schemas that inherit from other schemas now automatically preserve all properties from their parent schemas.
- Fixed an issue where the persistence configuration failed when provided with a relative filesystem path.
- Fixed unique name autogeneration for the Python connectors.
v0.26.4
Added
- New external integration with Qdrant.
pw.io.mysql.writemethod for writing to MySQL. It supports two output table types: stream of changes and a realtime-updated data snapshot.
Changed
pw.io.deltalake.readnow accepts thestart_from_timestamp_msparameter for non-append-only tables. In this case, the connector will replay the history of changes in the table version by version starting from the state of the table at the given timestamp. The differences between versions will be applied atomically.- Asynchronous UDFs for connecting to API based llm and embedding models now have by default retry strategy set to
pw.udfs.ExponentialRetryStrategy() pw.io.postgres.writemethod now supports two output table types: stream of changes and realtime-updated data snapshot. The output table type can be chosen with theoutput_table_typeparameter.pw.io.postgres.write_snapshotmethod has been deprecated.
v0.26.3
Added
- New parser
pathway.xpacks.llm.parsers.PaddleOCRParsersupporting parsing of PDF, PPTX and images.
v0.26.2
Added
pw.io.gdrive.readnow supports the"only_metadata"format. When this format is used, the table will contain only metadata updates for the tracked directory, without reading object contents.- Detailed metrics can now be exported to SQLite. Enable this feature using the environment variable
PATHWAY_DETAILED_METRICS_DIRor viapw.set_monitoring_config(). pw.io.kinesis.readandpw.io.kinesis.writemethods for reading from and writing to AWS Kinesis.
Fixed
- A bug leading to potentially unbounded memory consumption that could occur in
Table.forgetandTable.sortoperators during multi-worker runs has been fixed. - Improved memory efficiency during cold starts by compacting intermediary structures and reducing retained memory after backfilling.
Changed
- The frequency of background operator snapshot compression in data persistence is limited to the greater of the user-defined
snapshot_intervalor 30 minutes when S3 or Azure is used as the backend, in order to avoid frequent calls to potentially expensive operations. - The Google Drive input connector performance has been improved, especially when handling directories with many nested subdirectories.
- The MCP server
toolmethod now allows to pass the optional datatitle,output_schema,annotationsandmetato inform the LLM client. - Relaxed boto3 dependency to <2.0.0.
v0.26.1
Added
pw.Table.forgetto remove old (in terms of event time) entries from the pipeline.pw.Table.buffer, a stateful buffering operator that delays entries untiltime_column <= max(time_column) - thresholdcondition is met.pw.Table.ignore_lateto filter out old (in terms of event time) entries.- Rows batching for async UDFs. It can be enabled with
max_batch_sizeparameter.
Changed
pw.io.subscribeandpw.io.python.writenow work with async callbacks.- The
diffcolumn in tables automatically created bypw.io.postgres.writeandpw.io.postgres.write_snapshotinreplaceandcreate_if_not_existsinitialization modes now uses thesmallinttype. optimize_transaction_logoption has been removed frompw.io.deltalake.TableOptimizer.
Fixed
pw.io.postgres.writeandpw.io.postgres.write_snapshotnow respect the type optionality defined in the Pathway table schema when creating a new PostgreSQL table. This applies to thereplaceandcreate_if_not_existsinitialization modes.
v0.26.0
Added
path_filterparameter inpw.io.s3.readandpw.io.minio.readfunctions. It enables post-filtering of object paths using a wildcard pattern (*,?), allowing exclusion of paths that pass the mainpathfilter but do not matchpath_filter.- Input connectors now support backpressure control via
max_backlog_size, allowing to limit the number of read events in processing per connector. This is useful when the data source emits a large initial burst followed by smaller, incremental updates. pw.reducers.count_distinctandpw.reducers.count_distinct_approximateto count the number of distinct elements in a table. Thepw.reducers.count_distinct_approximateallows you to save memory by decreasing the accuracy. It is possible to control this tradeoff by using theprecisionparameter.pw.Table.join(and its variants) now has two additional parameters -left_exactly_onceandright_exactly_once. If the elements from a side of a join should be joined exactly once,*_exactly_onceparameter of the side can be set toTrue. Then after getting a match an entry will be removed from the join state and the memory consumption will be reduced.
Changed
- Delta table compression logging has been improved: logs now include table names, and verbose messages have been streamlined while preserving details of important processing steps.
- Improved initialization speed of
pw.io.s3.readandpw.io.minio.read. pw.io.s3.readandpw.io.minio.readnow limit the number and the total size of objects to be predownloaded.- BREAKING optimized the implementation of
pw.reducers.min,pw.reducers.max,pw.reducers.argmin,pw.reducers.argmax,pw.reducers.anyreducers for append-only tables. It is a breaking change for programs using operator persistence. The persisted state will have to be recomputed. - BREAKING optimized the implementation of
pw.reducers.sumreducer onfloatandnp.ndarraycolumns. It is a breaking change for programs using operator persistence. The persisted state will have to be recomputed. - BREAKING the implementation of data persistence has been optimized for the case of many small objects in filesystem and S3 connectors. It is a breaking change for programs using data persistence. The persisted state will have to be recomputed.
- BREAKING the data snapshot logic in persistence has been optimized for the case of big input snapshots. It is a breaking change for programs using data persistence. The persisted state will have to be recomputed.
- Improved precision of
pw.reducers.sumonfloatcolumns by introducing Neumeier summation.
v0.25.1
Added
pw.xpacks.llm.mcp_server.PathwayMcpthat allows servingpw.xpacks.llm.document_store.DocumentStoreandpw.xpacks.llm.question_answeringendpoints as MCP (Model Context Protocol) tools.pw.io.dynamodb.writemethod for writing to Dynamo DB.
v0.25.0
Added
pw.io.questdb.writemethod for writing to Quest DB.pw.io.fs.readnow supports the"only_metadata"format. When this format is used, the table will contain only metadata updates for the tracked directory, without reading file contents.
Changed
- BREAKING The Elasticsearch and BigQuery connectors have been moved to the Scale license tier. You can obtain the Scale tier license for free at https://pathway.com/get-license.
- BREAKING
pw.io.fs.readno longer acceptsformat="raw". Useformat="binary"to read binary objects,format="plaintext_by_file"to read plaintext objects per file, orformat="plaintext"to read plaintext objects split into lines. - BREAKING The
pw.io.s3_csv.readconnector has been removed. Please usepw.io.s3.readwithformat="csv"instead.
Fixed
pw.io.s3.readandpw.io.s3.writenow also check theAWS_PROFILEenvironment variable for AWS credentials if none are explicitly provided.
v0.24.1
Added
- Confluent Schema Registry support in Kafka and Redpanda input and output connectors.
Changed
pw.io.airbyte.readwill now retry the pip install command if it fails during the installation of a connector. It only applies when using the PyPI version of the connector, not the Docker one.
v0.24.0
Added
pw.io.mqtt.readandpw.io.mqtt.writemethods for reading from and writing to MQTT.
Changed
pw.xpacks.llm.embedders.SentenceTransformerEmbedderandpw.xpacks.llm.llms.HFPipelineChatare now computed in batches. The maximum size of a single batch can be set in the constructor with the argumentmax_batch_size.- BREAKING Arguments
api_keyandbase_urlforpw.xpacks.llm.llms.OpenAIChatcan no longer be set in the__call__method, and instead, if needed, should be set in the constructor. - BREAKING Argument
api_keyforpw.xpacks.llm.llms.OpenAIEmbeddercan no longer be set in the__call__method, and instead, if needed, should be set in the constructor. pw.io.postgres.writenow accepts arbitrary types for the values of thepostgres_settingsdict. If a value is not a string, Python'sstr()method will be used.
Removed
pw.io.kafka.read_from_upstashhas been removed, as the managed Kafka service in Upstash has been deprecated.