Releases: pathwaycom/pathway
Releases · pathwaycom/pathway
v0.20.0
[0.20.0] - 2025-02-25
Added
- Added structure-aware chunking for
DoclingParser. - Added
table_parsing_strategyforDoclingParser. - Column expressions
as_int(),as_float(),as_str(), andas_bool()now accept additional arguments,unwrapanddefault, to simplify null handling. - Support for python tuples in expressions.
Changed
- BREAKING: Changed the argument in
DoclingParserfromparse_images(bool) intoimage_parsing_strategy(Literal["llm"] | None). - BREAKING:
doc_post_processorsargument in thepw.xpacks.llm.document_store.DocumentStorenow longer acceptspw.UDF. - Better error messages when using
pathway spawnwith multiple workers. Now error messages are printed only from the worker experiencing the error directly.
Fixed
doc_post_processorsargument in thepw.xpacks.llm.document_store.DocumentStorehad no effect. This is now fixed.
v0.19.0
Added
LLMRerankernow supports custom prompts as well as custom response parsers allowing for other ranking scales apart from default 1-5.pw.io.kafka.writeandpw.io.nats.writenow supportColumnReferenceas a topic name. When aColumnReferenceis provided, each message's topic is determined by the corresponding column value.pw.io.python.writeacceptingConnectorObserveras an alternative topw.io.subscribe.pw.io.iceberg.readandpw.io.iceberg.writenow support S3 as data backend and AWS Glue catalog implementations.- All output connectors now support the
sort_byfield for ordering output within a single minibatch. - A new UDF executor
pw.udfs.fully_async_executor. It allows for creation of non-blocking asynchronous UDFs which results can be returned in the future processing time. - A Future data type to represent results of fully asynchronous UDFs.
pw.Table.await_futuresmethod to wait for results of fully asynchronous UDFs.pw.io.deltalake.writenow supports partition columns specification.
Changed
- BREAKING: Changed the interface of
LLMReranker, theuse_logit_bias,cache_strategy,retry_strategyandkwargsarguments are no longer supported. - BREAKING: LLMReranker no longer inherits from pw.UDF
- BREAKING:
pw.stdlib.utils.AsyncTransformer.output_tablenow returns a table with columns with Future data type. pw.io.deltalake.readcan now read append-only tables without requiring explicit specification of primary key fields.
v0.18.0
Added
pw.io.postgres.writeandpw.io.postgres.write_snapshotnow handle serialization ofPyObjectWrapperandTimedeltaproperly.- New chunking options in
pathway.xpacks.llm.parsers.UnstructuredParser - Now all Pathway types can be serialized into JSON and consistently deserialized back.
table.col.dt.to_durationconverting an integer into apw.Duration.pw.Jsonnow supports storing datetime and duration type values in ISO format.
Changed
- BREAKING: Changed the interface of
UnstructuredParser - BREAKING: The
Pointertype is now serialized and deserialized as a string field in Iceberg and Delta Lake. - BREAKING: The
Bytestype is now serialized and deserialized with base64 encoding and decoding when the JSON format is used. A string field is used to store the encoded contents. - BREAKING: The
Arraytype is now serialized and deserialized as an object with two fields:shapedenoting the shape of the stored multi-dimensional array andelementsdenoting the elements of the flattened array. - BREAKING: Marked package as py.typed to indicate support for type hints.
Removed
- BREAKING: Removed undocumented
license_keyargument frompw.runandpw.run_allmethods. Instead,pw.set_license_keyshould be used.
v0.17.0
Added
pw.io.iceberg.readmethod for reading Apache Iceberg tables into Pathway.- methods
pw.io.postgres.writeandpw.io.postgres.write_snapshotnow accept an additional argumentinit_mode, which allows initializing the table before writing. pw.io.deltalake.readnow supports serialization and deserialization for all Pathway data types.- New parser
pathway.xpacks.llm.parsers.DoclingParsersupporting parsing of pdfs with tables and images. - Output connectors now include an optional
nameparameter. If provided, this name will appear in logs and monitoring dashboards. - Automatic naming for input and output connectors has been enhanced.
Changed
- BREAKING:
pw.io.deltalake.readnow requires explicit specification of primary key fields. - BREAKING:
pw.xpacks.llm.question_answering.BaseRAGQuestionAnswerernow returns a dictionary frompw_ai_answerendpoint. pw.xpacks.llm.question_answering.BaseRAGQuestionAnswererallows optionally returning context documents frompw_ai_answerendpoint.- BREAKING: When using delay in temporal behavior, current time is updated immediately, not in the next batch.
- BREAKING: The
Pointertype is now serialized to Delta Tables as raw bytes. pw.io.kafka.writenow allows to specifykeyandheadersfor JSON and CSV data formats.persistent_idparameter in connectors has been renamed toname. This newnameparameter allows you to assign names to connectors, which will appear in logs and monitoring dashboards.- Changed names of parsers to be more consistent:
ParseUnstrutured->UnstructuredParser,ParseUtf8->Utf8Parser.ParseUnstruturedandParseUtf8are now deprecated.
Fixed
generate_classmethod inSchemanow correctly renders columns ofUnionTypeandNonetypes.- a bug in delay in temporal behavior. It was possible to emit a single entry twice in a specific situation.
pw.io.postgres.write_snapshotnow correctly handles tables that only have primary key columns.
Removed
- BREAKING:
pw.indexing.build_sorted_index,pw.indexing.retrieve_prev_next_values,pw.indexing.sort_from_indexandpw.indexing.SortedIndexare removed. Sorting is now done withpw.Table.sort. - BREAKING: Removed deprecated methods
pw.Table.unsafe_promise_same_universe_as,pw.Table.unsafe_promise_universes_are_pairwise_disjoint,pw.Table.unsafe_promise_universe_is_subset_of,pw.Table.left_join,pw.Table.right_join,pw.Table.outer_join,pw.stdlib.utils.AsyncTransformer.result. - BREAKING: Removed deprecated column
_pw_shardin the result ofwindowby. - BREAKING: Removed deprecated functions
pw.debug.parse_to_table,pw.udf_async,pw.reducers.npsum,pw.reducers.int_sum,pw.stdlib.utils.col.flatten_column. - BREAKING: Removed deprecated module
pw.asynchronous. - BREAKING: Removed deprecated access to functions from
pw.ioinpw. - BREAKING: Removed deprecated classes
pw.UDFSync,pw.UDFAsync. - BREAKING: Removed class
pw.xpack.llm.parsers.OpenParse. It's functionality has been replaced withpw.xpack.llm.parsers.DoclingParser. - BREAKING: Removed deprecated arguments from input connectors:
value_columns,primary_key,types,default_values. Schema should be used instead.
v0.16.4
Fixed
- Google Drive connector in static mode now correctly displays in jupyter visualizations.
v0.16.3
Added
pw.io.iceberg.writemethod for writing Pathway tables into Apache Iceberg.
Changed
- values of non-deterministic UDFs are not stored in tables that are
append_only. pw.Table.ixhas better runtime error message that includes id of the missing row.
Fixed
- temporal behaviors in temporal operators (
windowby,interval_join) now consume no CPU when no data passes through them.
v0.16.2
Added
pw.xpacks.llm.prompts.RAGPromptTemplate, set of prompt utilities that enable verifying templates and creating UDFs from prompt strings or callables.pw.xpacks.llm.question_answering.BaseContextProcessorstreamlines development and tuning of representing retrieved context documents to the LLM.pw.io.kafka.readnow supportswith_metadataflag, which makes it possible to attach the metadata of the Kafka messages to the table entries.pw.io.deltalake.readcan now stream the tables with deletions, if no deletion vectors were used.
Changed
pw.io.sharepoint.readnow explicitly terminates with an error if it fails to read the data the specified number of times per row (the default is8).pw.xpacks.llm.prompts.prompt_qa, and other prompts expect 'context' and 'query' fields instead of 'docs'.- Removed support for
short_prompt_templateandlong_prompt_templateinpw.xpacks.llm.question_answering.BaseRAGQuestionAnswerer. These prompt variants are no longer accepted during construction or in requests. pw.xpacks.llm.question_answering.BaseRAGQuestionAnswererallows setting user created prompts. Templates are verified to include 'context' and 'query' placeholders.pw.xpacks.llm.question_answering.BaseRAGQuestionAnswerercan take aBaseContextProcessorthat represents context documents to the LLM. Defaults topw.xpacks.llm.question_answering.SimpleContextProcessorwhich filters metadata fields and joins the documents with new lines.
Fixed
- The input of
pw.io.fs.readandpw.io.s3.readis now correctly persisted in case deletions or modifications of already processed objects take place.
v0.16.1
Changed
pw.io.s3.readnow monitors object deletions and modifications in the S3 source, when ran in streaming mode. When an object is deleted in S3, it is also removed from the engine. Similarly, if an object is modified in S3, the engine updates its state to reflect those changes.pw.io.s3.readnow supportswith_metadataflag, which makes it possible to attach the metadata of the source object to the table entries.
Fixed
pw.xpacks.llm.document_store.DocumentStoreno longer requires_metadatacolumn in the input table.
v0.16.0
Changelog
All notable changes to this project will be documented in this file.
This project adheres to Semantic Versioning.
[Unreleased]
[0.16.0] - 2024-11-29
Added
pw.xpacks.llm.document_store.SlidesDocumentStore, which is a subclass ofpw.xpacks.llm.document_store.DocumentStorecustomized for retrieving slides from presentations.pw.temporal.inactivity_detectionandpw.temporal.utc_nowfunctions allowing for alerting and other time dependent usecases
Changed
pw.Table.concat,pw.Table.with_id,pw.Table.with_id_fromno longer perform checks if ids are unique. It improves memory usage.- table operations that store values (like
pw.Table.join,pw.Table.update_cells) no longer store columns that are not used downstream. append_onlycolumn property is now propagated better (there are more places where we can infer it).- BREAKING: Unused arguments from the constructor
pw.xpacks.llm.question_answering.DeckRetrieverare no longer accepted.
Fixed
query_as_of_nowofpw.stdlib.indexing.DataIndexandpw.stdlib.indexing.HybridIndexnow work in constant memory for infinite query stream (no query-related data is kept after query is answered).
v0.15.4
Added
pw.io.kafka.readnow supports reading entries starting from a specified timestamp.pw.io.nats.readandpw.io.nats.writemethods for reading from and writing Pathway tables to NATS.
Changed
pw.Table.diffnow supports settinginstanceparameter that allows computing differences for multiple groups.pw.io.postgres.write_snapshotnow keeps the Postgres table fully in sync with the current state of the table in Pathway. This means that if an entry is deleted in Pathway, the same entry will also be deleted from the Postgres table managed by the output connector.
Fixed
pw.PyObjectWrapperis now picklable.