-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Aryn notebook under integrations #361
Conversation
Found 1 changed notebook. Review the changes at https://app.gitnotebooks.com/elastic/elasticsearch-labs/pull/361 |
💚 CLA has been signed |
@@ -0,0 +1 @@ | |||
This folder contains examples showing how to prepare data using Aryn Sycamore and load into Elasticsearch for RAG and GenAI use cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonfritz Can you include a link back to the blog when it is published as well as any requirements to run this notebook listed here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will after it's published.
@jonfritz please make sure to sign our CLA to ensure merging is possible |
@jonfritz can you share a link to the blog draft so I can connect the two? |
@justincastilla I just shared with you the Blog Draft |
Update with pip install
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@ElishevaStern, some of the checks are failing before this can be merged. Can you or someone from the team please help? |
Fly-by comment: I see a reference to Also, I see some references to OpenSearch. Maybe we can still replace those too (might just be a folder name that stayed behind). |
@jonfritz can you check the pre-commit jupyter-black messages? It's not liking some of the formatting. Also see xeraa's comment above. Here's a sample of one of the errors ran through copilot: The error in the logs is an AssertionError caused by the aryn_api_key being an empty string in the partition_pdf function. Here is the relevant log snippet: File "/home/runner/work/elasticsearch-labs/elasticsearch-labs/.venv/lib/python3.10/site-packages/sycamore/transforms/detr_partitioner.py", line 165, in partition_pdf
assert aryn_api_key != ""
AssertionError To fix this error, ensure that the aryn_api_key is properly set and not empty before calling the partition_pdf function. Check the environment variables or configuration files to ensure the API key is correctly provided. You can find the related code in the detr_partitioner.py file, specifically around the partition_pdf function. |
Update from feedback
Made the updates. The Aryn API Key needs to be set, along with the OpenAI key. What do you suggest we do in the notebook for this, as it does mention it in a comment? I'm not sure what formatting errors you are referring to - can you please elaborate? Thx! |
Here's the output from the first failed test: 14s
Run pre-commit/[email protected]
Run python -m pip install pre-commit
Collecting pre-commit
Downloading pre_commit-4.0.[1](https://github.com/elastic/elasticsearch-labs/actions/runs/12758225279/job/35560933131?pr=361#step:4:1)-py2.py3-none-any.whl.metadata (1.3 kB)
Collecting cfgv>=2.0.0 (from pre-commit)
Downloading cfgv-3.4.0-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting identify>=1.0.0 (from pre-commit)
Downloading identify-2.6.5-py2.py3-none-any.whl.metadata (4.4 kB)
Collecting nodeenv>=0.11.1 (from pre-commit)
Downloading nodeenv-1.9.1-py2.py3-none-any.whl.metadata (21 kB)
Collecting pyyaml>=5.1 (from pre-commit)
Downloading PyYAML-6.0.2-cp312-cp312-manylinux_2_1[7](https://github.com/elastic/elasticsearch-labs/actions/runs/12758225279/job/35560933131?pr=361#step:4:8)_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
Collecting virtualenv>=20.10.0 (from pre-commit)
Downloading virtualenv-20.28.1-py3-none-any.whl.metadata (4.5 kB)
Collecting distlib<1,>=0.3.7 (from virtualenv>=20.10.0->pre-commit)
Downloading distlib-0.3.9-py2.py3-none-any.whl.metadata (5.2 kB)
Collecting filelock<4,>=3.12.2 (from virtualenv>=20.10.0->pre-commit)
Downloading filelock-3.16.1-py3-none-any.whl.metadata (2.9 kB)
Collecting platformdirs<5,>=3.9.1 (from virtualenv>=20.10.0->pre-commit)
Downloading platformdirs-4.3.6-py3-none-any.whl.metadata (11 kB)
Downloading pre_commit-4.0.1-py2.py3-none-any.whl (218 kB)
Downloading cfgv-3.4.0-py2.py3-none-any.whl (7.2 kB)
Downloading identify-2.6.5-py2.py3-none-any.whl (99 kB)
Downloading nodeenv-1.9.1-py2.py3-none-any.whl (22 kB)
Downloading PyYAML-6.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (767 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 767.5/767.5 kB 66.9 MB/s eta 0:00:00
Downloading virtualenv-20.28.1-py3-none-any.whl (4.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.3/4.3 MB 115.7 MB/s eta 0:00:00
Downloading distlib-0.3.9-py2.py3-none-any.whl (468 kB)
Downloading filelock-3.16.1-py3-none-any.whl (16 kB)
Downloading platformdirs-4.3.6-py3-none-any.whl (18 kB)
Installing collected packages: distlib, pyyaml, platformdirs, nodeenv, identify, filelock, cfgv, virtualenv, pre-commit
Successfully installed cfgv-3.4.0 distlib-0.3.9 filelock-3.16.1 identify-2.6.5 nodeenv-1.9.1 platformdirs-4.3.6 pre-commit-4.0.1 pyyaml-6.0.2 virtualenv-20.28.1
Run python -m pip freeze --local
cfgv==3.4.0
distlib==0.3.9
filelock==3.16.1
identify==2.6.5
nodeenv==1.9.1
platformdirs==4.3.6
pre_commit==4.0.1
PyYAML==6.0.2
virtualenv==20.28.1
Run actions/cache@v3
Cache Size: ~61 MB (64116572 B)
/usr/bin/tar -xf /home/runner/work/_temp/e535c5b8-f865-4bef-83fa-85180fa331e4/cache.tzst -P -C /home/runner/work/elasticsearch-labs/elasticsearch-labs --use-compress-program unzstd
Cache restored successfully
Cache restored from key: pre-commit-3|/opt/hostedtoolcache/Python/3.12.8/x64|09e2a22a29238ed334baaf5b64c742a0f3883fd9261d0031de09ef33672ea97b
Run pre-commit run --show-diff-on-failure --color=always --all-files
pre-commit run --show-diff-on-failure --color=always --all-files
shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
env:
pythonLocation: /opt/hostedtoolcache/Python/3.12.8/x64
LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.12.8/x64/lib
ripsecrets...............................................................Passed
black-jupyter............................................................Failed
- hook id: black-jupyter
- exit code: 123
<unknown>:11: SyntaxWarning: invalid escape sequence '\.'
<unknown>:11: SyntaxWarning: invalid escape sequence '\.'
<unknown>:11: SyntaxWarning: invalid escape sequence '\.'
error: cannot format notebooks/integrations/aryn/aryn-elasticsearch-blog-dataprep.ipynb: ('unexpected EOF in multi-line statement', (7, 0))
Oh no! 💥 💔 💥
[13](https://github.com/elastic/elasticsearch-labs/actions/runs/12758225279/job/35560933131?pr=361#step:4:15)8 files left unchanged, 1 file failed to reformat.
Error: Process completed with exit code 1. |
How did you install the pre-commit checks? Could you try this if you haven't already:
|
Hi, @jonfritz, If the above command doesn't work, another option is to move the notebook from |
Add placeholder API key
Add API key placeholder
I just added a dummy API key, and this should work! Will the tests run automatically? |
Signed-off-by: Henry Lindeman <[email protected]>
fix formatting
Just signed the CLA. Can someone rerun that check? |
@jonfritz have you tried the two solutions @JessicaGarson suggested? |
Those were to fix the formatting from precommit, which is now passing |
The issue here I think is that your tests seem to require valid API keys, which I cannot put in the notebook. How do you deal with this for other notebooks, or for things like OpenAI? |
Add notebook and README for Aryn in the integrations section.