chatbot-rag-app: recover from timeouts on first use of ELSER #397

codefromthecrypt · 2025-02-19T09:49:54Z

Through investigation, I found that timeout errors on first use were ultimately caused by a buildup of ML jobs, which until complete prevent operation. This change catches the first error doing bulk indexing which is about 10s later. Then, the code watches for ML jobs to settle, which is a less frustrating experience than users having to do so for periods of several minutes.

The feedback solution I took was to do this:

Make sure stdout is flushed so people see what's going on
Add logs that we started bulk indexing, which explains the >10s initial hang
Add a spinner when tty is usable (only tty is usable e.g. docker compose run or python)
- not a progress bar because the tasks fluctuate and sometimes increase before they settle

This makes it better because not only will this succeed eventually, there are many ways to know something is progressing.

e.g. here, you can see the initial fail and it resolving itself.

Those testing from this branch need to remember to build the image first, as it uses the published code otherwise, which is not from this branch.

e.g. docker compose run --rm --build create-index or docker compose up --force-recreate --build

While I was here, I tuned the docker compose setup slightly. At first, I thought there was a memory issue, but there isn't in fact 2GB is plenty and causes less worry considering many other containers are running. I also matched health check behavior with upstream work in kibana. I also made it possible to use docker compose run and see the spinner. As well, I renamed the docker compose service to match the python command so it isn't confusingly different. Finally, I updated dependencies just for hygiene.

Fixes #307

codefromthecrypt · 2025-02-19T09:50:52Z

Here's an example run, which as it didn't fail I could get a trace. It took 4.5minutes to resolve, but without any manual intervention cc @anuraaga

[+] Running 2/1
 ✔ Container ingest-data   Recreated                                                                                                                                                                                                                                 0.0s 
 ✔ Container api-frontend  Recreated                                                                                                                                                                                                                                 0.0s 
Attaching to api-frontend, ingest-data
ingest-data   | /usr/local/lib/python3.12/site-packages/opentelemetry/instrumentation/elasticsearch/__init__.py:213: Warning: Instrumentation disabled, relying on elasticsearch native OTel support, see https://opentelemetry-python-contrib.readthedocs.io/en/latest/instrumentation/elasticsearch/elasticsearch.html
ingest-data   |   warnings.warn(
ingest-data   | /app/data/index_data.py:122: UserWarning: Error occurred, will retry after ML jobs complete: Connection timed out
ingest-data   |   warn(f"Error occurred, will retry after ML jobs complete: {e}")
ingest-data   | /app/data/index_data.py:145: GeneralAvailabilityWarning: This API is in technical preview and may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features.
ingest-data   |   resp = es.tasks.list(detailed=True, actions=["cluster:monitor/xpack/ml/*"])
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 52 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 50 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 48 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 44 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 42 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 40 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 38 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 36 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 34 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 32 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 30 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 28 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 26 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 24 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 22 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 20 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 18 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 16 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 14 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 12 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 10 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 8 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 6 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 4 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | /app/data/index_data.py:154: UserWarning: Awaiting 2 ML tasks
ingest-data   |   warn(f"Awaiting {current_task_count} ML tasks")
ingest-data   | ".elser_model_2" model not available, downloading it now
ingest-data   | ".elser_model_2" model is deployed
ingest-data   | ".elser_model_2" model is ready
ingest-data   | Loading data from $/app/data/data.json
ingest-data   | Loaded 15 documents
ingest-data   | Split 15 documents into 26 chunks
ingest-data   | Creating Elasticsearch sparse vector store in http://localhost:9200
ingest-data exited with code 0

codefromthecrypt · 2025-02-20T00:18:09Z

I updated to use a spinner as it feels nicer when run as a TTY. The tasks fluctuate sometimes increasing, so a progress bar is not a good fit.

When run with a tty (instructions updated for docker compose run -T or normal python, you see a spinner:

When using docker-compose up (no tty support really). Will be same with kubernetes, there's no incremental feedback, but at least there is feedback.

create-index  | Awaiting 4 ML tasks
create-index  | ML tasks complete
create-index  | 
create-index exited with code 0

codefromthecrypt · 2025-02-20T00:45:14Z

OK I updated to put the spinner (only seen in docker compose run or python) above the dance around ML tasks as it also helps explain the 10s timeout fail. Finally, I reverted to using docker compose up by default as it is easier and even though we don't see a spinner, we know what's going on. PTAL

codefromthecrypt · 2025-02-20T00:51:57Z

example-apps/chatbot-rag-app/docker-compose.yml

@@ -1,16 +1,17 @@
 name: chatbot-rag-app

 services:
-  ingest-data:
+  create-index:


renamed to match the flask command as having a different name here is confusing esp with docker compose run vs python

example-apps/chatbot-rag-app/data/index_data.py

codefromthecrypt · 2025-02-20T01:34:44Z

thanks for the review, @anuraaga I changed based on all the comments you made which were sensible of course!

.github/workflows/docker-chatbot-rag-app.yml

codefromthecrypt · 2025-02-20T07:34:00Z

So, with @anuraaga's help I reproduced the very long first use of ELSER in CI. This test only kicks off when we affect the docker related files so that it doesn't delay routine PRs.

Basically, x86 settles in <1m, where arm64 is over 13 minutes. My Apple M3 Pro was 4.5 minutes.

codefromthecrypt · 2025-02-20T22:57:46Z

@anuraaga as we are blocked on review I will squash this PR so that I can more easily integrate the k8s branch over it.

Fixes #307 Signed-off-by: Adrian Cole <[email protected]>

codefromthecrypt · 2025-02-21T12:13:44Z

thanks for the help @anuraaga and the review @joemcelroy!

codefromthecrypt requested a review from joemcelroy February 19, 2025 09:49

This was referenced Feb 19, 2025

chatbot-rag-app: timeout errors installing elser and/or while it is initializing #307

Closed

chatbot-rag-app: adds Kubernetes manifest and instructions #396

Draft

codefromthecrypt commented Feb 20, 2025

View reviewed changes

anuraaga approved these changes Feb 20, 2025

View reviewed changes

codefromthecrypt force-pushed the recover-from-timeout branch from 45c819c to 1bfc9ea Compare February 20, 2025 06:01

anuraaga reviewed Feb 20, 2025

View reviewed changes

.github/workflows/docker-chatbot-rag-app.yml Outdated Show resolved Hide resolved

codefromthecrypt force-pushed the recover-from-timeout branch from 97dd647 to 19e6280 Compare February 20, 2025 07:10

anuraaga reviewed Feb 20, 2025

View reviewed changes

.github/workflows/docker-chatbot-rag-app.yml Outdated Show resolved Hide resolved

anuraaga approved these changes Feb 20, 2025

View reviewed changes

chatbot-rag-app: recover from timeouts on first use of ELSER

48d829f

Fixes #307 Signed-off-by: Adrian Cole <[email protected]>

codefromthecrypt force-pushed the recover-from-timeout branch from c5f39cc to 48d829f Compare February 20, 2025 22:59

joemcelroy approved these changes Feb 21, 2025

View reviewed changes

codefromthecrypt merged commit 2580c00 into main Feb 21, 2025
5 checks passed

codefromthecrypt deleted the recover-from-timeout branch February 21, 2025 12:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chatbot-rag-app: recover from timeouts on first use of ELSER #397

chatbot-rag-app: recover from timeouts on first use of ELSER #397

codefromthecrypt commented Feb 19, 2025 •

edited

Loading

codefromthecrypt commented Feb 19, 2025

codefromthecrypt commented Feb 20, 2025

codefromthecrypt commented Feb 20, 2025

codefromthecrypt Feb 20, 2025 •

edited

Loading

codefromthecrypt commented Feb 20, 2025

codefromthecrypt commented Feb 20, 2025 •

edited

Loading

codefromthecrypt commented Feb 20, 2025

codefromthecrypt commented Feb 21, 2025

chatbot-rag-app: recover from timeouts on first use of ELSER #397

chatbot-rag-app: recover from timeouts on first use of ELSER #397

Conversation

codefromthecrypt commented Feb 19, 2025 • edited Loading

codefromthecrypt commented Feb 19, 2025

codefromthecrypt commented Feb 20, 2025

codefromthecrypt commented Feb 20, 2025

codefromthecrypt Feb 20, 2025 • edited Loading

Choose a reason for hiding this comment

codefromthecrypt commented Feb 20, 2025

codefromthecrypt commented Feb 20, 2025 • edited Loading

codefromthecrypt commented Feb 20, 2025

codefromthecrypt commented Feb 21, 2025

codefromthecrypt commented Feb 19, 2025 •

edited

Loading

codefromthecrypt Feb 20, 2025 •

edited

Loading

codefromthecrypt commented Feb 20, 2025 •

edited

Loading