-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chatbot-rag-app: recover from timeouts on first use of ELSER #397
Conversation
Here's an example run, which as it didn't fail I could get a trace. It took 4.5minutes to resolve, but without any manual intervention cc @anuraaga
|
@@ -1,16 +1,17 @@ | |||
name: chatbot-rag-app | |||
|
|||
services: | |||
ingest-data: | |||
create-index: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to match the flask command as having a different name here is confusing esp with docker compose run
vs python
thanks for the review, @anuraaga I changed based on all the comments you made which were sensible of course! |
45c819c
to
1bfc9ea
Compare
97dd647
to
19e6280
Compare
So, with @anuraaga's help I reproduced the very long first use of ELSER in CI. This test only kicks off when we affect the docker related files so that it doesn't delay routine PRs. Basically, x86 settles in <1m, where arm64 is over 13 minutes. My Apple M3 Pro was 4.5 minutes. |
@anuraaga as we are blocked on review I will squash this PR so that I can more easily integrate the k8s branch over it. |
Fixes #307 Signed-off-by: Adrian Cole <[email protected]>
c5f39cc
to
48d829f
Compare
thanks for the help @anuraaga and the review @joemcelroy! |
Through investigation, I found that timeout errors on first use were ultimately caused by a buildup of ML jobs, which until complete prevent operation. This change catches the first error doing bulk indexing which is about 10s later. Then, the code watches for ML jobs to settle, which is a less frustrating experience than users having to do so for periods of several minutes.
The feedback solution I took was to do this:
docker compose run
orpython
)This makes it better because not only will this succeed eventually, there are many ways to know something is progressing.
e.g. here, you can see the initial fail and it resolving itself.
Those testing from this branch need to remember to build the image first, as it uses the published code otherwise, which is not from this branch.
e.g.
docker compose run --rm --build create-index
ordocker compose up --force-recreate --build
While I was here, I tuned the docker compose setup slightly. At first, I thought there was a memory issue, but there isn't in fact 2GB is plenty and causes less worry considering many other containers are running. I also matched health check behavior with upstream work in kibana. I also made it possible to use
docker compose run
and see the spinner. As well, I renamed the docker compose service to match the python command so it isn't confusingly different. Finally, I updated dependencies just for hygiene.Fixes #307