Skip to content

test: add vectorizer benchmark tools #570

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

alejandrodnm
Copy link
Contributor

@alejandrodnm alejandrodnm commented Mar 18, 2025

New benchmark just recipes for the vectorizer:

  • just pgai benchmark-mem: runs the vectorizer on a queue of 500
    items, generates a memray profile, and a memory usage flamegraph.

  • just pgai benchmark-cpu: runs the vectorizer on a queue of 500
    items, generates a CPU usage flamegraph using py-spy.

  • just pgai benchmark-cpu-top: runs the vectorizer on a queue of 500
    items, displays a top like interface of CPU usage.

  • just pgai benchmark-queue-count: shows the queue count of the
    running benchmark. Should be executed in a separate terminal. The
    count is updated in an interval.

The benchmark DB uses the wiki.dump stored in the repository, creates an
openAI vectorizer, and runs the benchmark tool storing the results in
projects/pgai/benchmark/results.

There's a new command for the cli pgai vectorizer worker-benchmark.
It's the same as the regular worker, but it's wrapped in a vcr client
that replays openAI requests. This command is used by the CPU benchmarks
in order to have constant request/response times to the API, and have
more deterministic time results. The cassette file is tracked using
git-lfs.

Memory benchmarks don't use the vcr wrapped command because they pollute
VCR will use considerable memory to handle the API calls.

https://www.loom.com/share/5538c8fa4bb243d88a246ca6b82e7cc7?sid=89bf1ed6-8def-4e6e-8fb2-711f42660b7f

https://www.loom.com/share/733fca2f5008460f9d763a72544db5fb?sid=e56f1a72-69f3-41c8-9b3f-36f859f556fc

@alejandrodnm alejandrodnm temporarily deployed to internal-contributors March 18, 2025 16:18 — with GitHub Actions Inactive
@alejandrodnm alejandrodnm temporarily deployed to internal-contributors March 18, 2025 16:22 — with GitHub Actions Inactive
@alejandrodnm alejandrodnm marked this pull request as ready for review March 18, 2025 16:28
@alejandrodnm alejandrodnm requested a review from a team as a code owner March 18, 2025 16:28
@alejandrodnm alejandrodnm temporarily deployed to internal-contributors March 18, 2025 17:18 — with GitHub Actions Inactive
@alejandrodnm alejandrodnm temporarily deployed to internal-contributors March 18, 2025 17:19 — with GitHub Actions Inactive
@alejandrodnm alejandrodnm temporarily deployed to internal-contributors March 19, 2025 07:32 — with GitHub Actions Inactive
@alejandrodnm alejandrodnm temporarily deployed to internal-contributors March 19, 2025 10:31 — with GitHub Actions Inactive
@alejandrodnm alejandrodnm temporarily deployed to internal-contributors March 19, 2025 12:09 — with GitHub Actions Inactive
@alejandrodnm alejandrodnm temporarily deployed to internal-contributors March 20, 2025 10:54 — with GitHub Actions Inactive
@alejandrodnm alejandrodnm temporarily deployed to internal-contributors April 10, 2025 17:21 — with GitHub Actions Inactive
@alejandrodnm alejandrodnm temporarily deployed to internal-contributors April 10, 2025 17:29 — with GitHub Actions Inactive
@@ -1 +1 @@
3.10
3.12.9
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reverting it back, I updated it for a test.

We are using 3.12 for the docker container and I wanted the same,

New benchmark just recipes for the vectorizer:

- `just pgai benchmark-mem`: runs the vectorizer on a queue of 500
  items, generates a memray profile, and a memory usage flamegraph.

- `just pgai benchmark-cpu`: runs the vectorizer on a queue of 500
  items, generates a CPU usage flamegraph using py-spy.

- `just pgai benchmark-cpu-top`: runs the vectorizer on a queue of 500
  items, displays a top like interface of CPU usage.

- `just pgai benchmark-queue-count`: shows the queue count of the
  running benchmark. Should be executed in a separate terminal. The
  count is updated in an interval.

The benchmark DB uses the wiki.dump stored in the repository, creates an
openAI vectorizer, and runs the benchmark tool storing the results in
`projects/pgai/benchmark/results`.

There's a new command for the cli `pgai vectorizer worker-benchmark`.
It's the same as the regular worker, but it's wrapped in a vcr client
that replays openAI requests. This command is used by the CPU benchmarks
in order to have constant request/response times to the API, and have
more deterministic time results. The cassette file is tracked using
git-lfs.

Memory benchmarks don't use the vcr wrapped command because they pollute
VCR will use considerable memory to handle the API calls.
@alejandrodnm
Copy link
Contributor Author

@JamesGuthrie I moved the part of vcr to the benchmark repo, and updated the recipes and create_vectorizer.sql to be more customizable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants