Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chatbot-rag-app: recover from timeouts on first use of ELSER #397

Merged
merged 1 commit into from
Feb 21, 2025

Conversation

codefromthecrypt
Copy link
Collaborator

@codefromthecrypt codefromthecrypt commented Feb 19, 2025

Through investigation, I found that timeout errors on first use were ultimately caused by a buildup of ML jobs, which until complete prevent operation. This change catches the first error doing bulk indexing which is about 10s later. Then, the code watches for ML jobs to settle, which is a less frustrating experience than users having to do so for periods of several minutes.

The feedback solution I took was to do this:

  • Make sure stdout is flushed so people see what's going on
  • Add logs that we started bulk indexing, which explains the >10s initial hang
  • Add a spinner when tty is usable (only tty is usable e.g. docker compose run or python)
    • not a progress bar because the tasks fluctuate and sometimes increase before they settle

This makes it better because not only will this succeed eventually, there are many ways to know something is progressing.

e.g. here, you can see the initial fail and it resolving itself.

image

Those testing from this branch need to remember to build the image first, as it uses the published code otherwise, which is not from this branch.

e.g. docker compose run --rm --build create-index or docker compose up --force-recreate --build

While I was here, I tuned the docker compose setup slightly. At first, I thought there was a memory issue, but there isn't in fact 2GB is plenty and causes less worry considering many other containers are running. I also matched health check behavior with upstream work in kibana. I also made it possible to use docker compose run and see the spinner. As well, I renamed the docker compose service to match the python command so it isn't confusingly different. Finally, I updated dependencies just for hygiene.

Fixes #307

@codefromthecrypt
Copy link
Collaborator Author

Here's an example run, which as it didn't fail I could get a trace. It took 4.5minutes to resolve, but without any manual intervention cc @anuraaga

[+] Running 2/1
 ✔ Container ingest-data   Recreated                                                                                                                                                                                                                                 0.0s 
 ✔ Container api-frontend  Recreated                                                                                                                                                                                                                                 0.0s 
Attaching to api-frontend, ingest-data
ingest-data   | /usr/local/lib/python3.12/site-packages/opentelemetry/instrumentation/elasticsearch/__init__.py:213: Warning: Instrumentation disabled, relying on elasticsearch native OTel support, see https://opentelemetry-python-contrib.readthedocs.io/en/latest/instrumentation/elasticsearch/elasticsearch.html
ingest-data   |   warnings.warn(
ingest-data   | /app/data/index_data.py:122: UserWarning: Error occurred, will retry after ML jobs complete: Connection timed out
ingest-data   |   warn(f"Error occurred, will retry after ML jobs complete: {e}")
ingest-data   | /app/data/index_data.py:145: GeneralAvailabilityWarning: This API is in technical preview and may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features.
ingest-data   |   resp = es.tasks.list(detailed=True, actions=["cluster:monitor/xpack/ml/*"])
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 52 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 50 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 48 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 44 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 42 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 40 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 38 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 36 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 34 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 32 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 30 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 28 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 26 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 24 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 22 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 20 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 18 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 16 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 14 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 12 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 10 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 8 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 6 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 4 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 2 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | ".elser_model_2" model not available, downloading it now
ingest-data   | ".elser_model_2" model is deployed
ingest-data   | ".elser_model_2" model is ready
ingest-data   | Loading data from $/app/data/data.json
ingest-data   | Loaded 15 documents
ingest-data   | Split 15 documents into 26 chunks
ingest-data   | Creating Elasticsearch sparse vector store in http://localhost:9200
ingest-data exited with code 0

@codefromthecrypt
Copy link
Collaborator Author

I updated to use a spinner as it feels nicer when run as a TTY. The tasks fluctuate sometimes increasing, so a progress bar is not a good fit.

When run with a tty (instructions updated for docker compose run -T or normal python, you see a spinner:
Kapture 2025-02-20 at 08 11 14

When using docker-compose up (no tty support really). Will be same with kubernetes, there's no incremental feedback, but at least there is feedback.

create-index  | Awaiting 4 ML tasks
create-index  | ML tasks complete
create-index  | 
create-index exited with code 0

@codefromthecrypt
Copy link
Collaborator Author

OK I updated to put the spinner (only seen in docker compose run or python) above the dance around ML tasks as it also helps explain the 10s timeout fail. Finally, I reverted to using docker compose up by default as it is easier and even though we don't see a spinner, we know what's going on. PTAL

Screenshot 2025-02-20 at 8 42 59 AM

@@ -1,16 +1,17 @@
name: chatbot-rag-app

services:
ingest-data:
create-index:
Copy link
Collaborator Author

@codefromthecrypt codefromthecrypt Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to match the flask command as having a different name here is confusing esp with docker compose run vs python

@codefromthecrypt
Copy link
Collaborator Author

thanks for the review, @anuraaga I changed based on all the comments you made which were sensible of course!

@codefromthecrypt
Copy link
Collaborator Author

codefromthecrypt commented Feb 20, 2025

So, with @anuraaga's help I reproduced the very long first use of ELSER in CI. This test only kicks off when we affect the docker related files so that it doesn't delay routine PRs.

Basically, x86 settles in <1m, where arm64 is over 13 minutes. My Apple M3 Pro was 4.5 minutes.

@codefromthecrypt
Copy link
Collaborator Author

@anuraaga as we are blocked on review I will squash this PR so that I can more easily integrate the k8s branch over it.

@codefromthecrypt codefromthecrypt merged commit 2580c00 into main Feb 21, 2025
5 checks passed
@codefromthecrypt codefromthecrypt deleted the recover-from-timeout branch February 21, 2025 12:13
@codefromthecrypt
Copy link
Collaborator Author

thanks for the help @anuraaga and the review @joemcelroy!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

chatbot-rag-app: timeout errors installing elser and/or while it is initializing
3 participants