What's Changed
- [agentic-exploration branch] Minor updates to dpk_intro_1_langchain notebook. by @revit13 in #942
- Starting new release cycle after cutoff 1.0.0 by @matouma in #968
- Updating the semantic rules in csv file by @pankajskku in #963
- [KFP] Obtain the Ray cluster run ID from the user for KFP v2. by @revit13 in #956
- Ordered in reverse chronology and added Dates for events by @agoyal26 in #976
- add exception handling in mkdocs hook by @shivdeep-singh-ibm in #984
- aded quick patch disabling fcntl for Windows by @matouma in #987
- Updating rag-html-1 example by @sujee in #949
- update maintainers by @touma-I in #986
- designate folder for all data-files used by various examples and tutorial by @matouma in #994
- added Optional step for enabiling kfp by @touma-I in #992
- Add extreme tokenize and readability transforms by @cmadam in #965
- Making column names lowercase to make output table schema compatible with the Lakehouse by @pankajskku in #979
- Documentation adjustments by @cmadam in #999
- README files for supporting native windows by @shahrokhDaijavad in #991
- gneissweb_classification by @ran-iwamoto in #974
- DPK processing of text data for finetuning by @PoojaHolkar in #973
- Fix some typos in contribute-your-own-transform.md by @shahrokhDaijavad in #1004
- Rep removal by @swith005 in #953
- Dev 1.0.1.dev1 by @matouma in #1006
- Fdedup package versioning and windows fixes by @cmadam in #1003
- Testing dev1 release by @matouma in #1014
- Reorganized landing page readme and added readme to examples folder by @agoyal26 in #1001
- Update contribute-your-own-transform.md by @shahrokhDaijavad in #1019
- pdf-processing-1 example updated by @sujee in #998
- Updating URLs to point to main data prep kit repo by @sujee in #1022
- Upgrade Docling to v2.21 by @dolfim-ibm in #1031
- Cargo fix by @swith005 in #1016
- Readability transform: performance improvement and adding score_list argument by @cmadam in #1026
- added writeup for building dev wheel by @matouma in #1025
- DPK LLM Agent by @Mohammad-nassar10 in #1021
- Rag pdf 2 by @sujee in #955
- change data files location to 'examples/data-files/pdf-processing-1' by @sujee in #1036
- Updates the doc to show how to pip install and run a transform at the CLI by @daw3rd in #928
- KFP v2: Fix wrong Ray cluster name by @revit13 in #1039
- Extreme Tokenize transform fails when the number of documents is not equal to the number of tokens sets by @cmadam in #1053
- GneissWeb_recipe_notebook by @Hajar-Emami in #1055
- Fixed Readme git website by @agoyal26 in #1049
- Add shorter alternative flags to options in execute_ray_job_multi_s3.py. by @revit13 in #1067
- Update super pipeline kfp v2. by @revit13 in #1066
- Update transform.py by @ian-cho in #1056
- Add Supported Languages Table to Lang_id transform by @shahrokhDaijavad in #1068
- test pr target by @matouma in #1075
- test using env variable by @matouma in #1076
- added pull request target to code quality and gneissweb by @matouma in #1077
- Update main README.md to fix two broken links in the table by @shahrokhDaijavad in #1074
- Fix PDF with RAG url in readme by @dpkshetty in #1062
- updated embedding model and LLM for rag-pdf-1 example by @sujee in #1060
- trigger on pull request by @matouma in #1082
- use None value rather than None string by @matouma in #1083
- Change gneissweb classification workflow to use PR Target by @matouma in #1095
- Update transform.py by @ian-cho in #1058
- Clear the notebook of the run details by @Hajar-Emami in #1071
- Share secret securrely across fork by @matouma in #1084
- Enabling gneissweb_classification transform by using multiple fasttext classifiers simultaneously by @ran-iwamoto in #1046
- Fdedup::transform() return 0 for success or error code by @cmadam in #1041
- Avoid exposing Hugginface token in lang-id kfp pipeline by @revit13 in #1099
- Tokenization2Arrow - New Transform to tokenize data and generate .arrow and metadata files by @santoshborse in #1033
- Update docling to 2.25 and enable XML/JATS by @dolfim-ibm in #1108
- Implementing Bloom Annotator by @ian-cho in #978
- GneissWeb Notebook that uses dev2 by @shahrokhDaijavad in #1103
- Dev3 testing by @matouma in #1111
- relax requirements for boto3 by @matouma in #1018
- toolkit release 0.2.4 and transforms release 1.1.0 by @matouma in #1115
New Contributors
- @ran-iwamoto made their first contribution in #974
- @swith005 made their first contribution in #953
- @Hajar-Emami made their first contribution in #1055
- @dpkshetty made their first contribution in #1062
Full Changelog: v1.0.0...v1.1.0