Skip to content

Persist sandbox information locally in orchestrator #376

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

tychoish
Copy link
Contributor

@tychoish tychoish commented Mar 5, 2025

This sets up a SQLite database in the orchestrator to track the
information about sandboxes in the orchestrator rather than the API
server, in support of rolling deployments.

API server(s) can't be responsible for this information anymore
because they might restart (because of deploys), there might be more
than one of them at a time (for redundancy, high avalibility, or as a
side effect of deployment.), and we don't want to add a load-baring
system of record for information about the sandboxes that the API
servers can access. Letting orchestrators be the system of record
makes sense because they already have the information, and the
lifecylce of a sandbox is (at the moment) tied to the lifecycle of the
orchestrator.

There are many implementation possibilities, but I/we went with SQLite
because:

  • it's well understood and battletested. It's also fast for this
    kind of workload and we can avoid writing filters/queries in Go, and
    just use SQL.

  • can be used with the DB tooling we already have. (n.b. this is the
    first time I've really used this ORM tool, and its pretty nice.)

  • writing data to disk means that we can be less worried about a large
    number of short running sandboxes filling up memory, and we can be
    less aggressive about removing data because there's (likely) plenty
    of disk space. We can rely on SQLite's caching mechanism (rather
    than Go or our own implementation) to keep or release data from
    memory.

  • because (at the moment) orchestrators never restart or are
    redeployed, we don't have to worry about schema or data migration:
    realistically every time the orchestrator starts, the database will
    be empty. In the future when we might be able to add

  • if the API servers' view of what's running on the orchestrator is no
    longer strictly consistent (because there might be many of them,
    they might restart, etc.) then we need to keep a record of not just
    what is running but what has run recently so we can make sure to
    bill correctly and so we can distinguish between "this sandbox
    doesn't exist anywhere" and "this sandbox used to exist."

  • embedded in this implementation are version numbers for both
    sandboxes (as they change) and a global version number for all the
    data in the database/orchestrator. The idea here is that if we
    increment these numbers correctly when modifying the data in SQLite,
    we can provide an interface that the API servers can use to
    efficently determine if their cache is out of date.

    • because we store the global version number in the sandbox record
      you can get all of the sandboxes that have been modified or
      created since your last view.

    • you can compare integers per-record or per orchestrator, to figure
      out if your data is stale rather than needing a more complex
      algorithim.

  • I did attempt to implement this using an in-memory cache rather than
    SQLite, which I think would be possible, but our concurrent map is
    sharded (to prevent lock contention for modification-heavy
    worklods,) and getting the version numbers (plus the extra level of
    shard-versioning,) makes things much more complicated from a code
    perspective and it's my assessment that SQLite will scale better,
    require less code to write, and be easier to develop code against today,
    in addition to being something we'll want in the future.

This isn't quite done. Remaining work includes:

  • We/I need to rehome the APIs to use data from the database rather
    than from the cache.

  • We/I probably need to cache a version of the sandbox structure (with
    more information,) in the database. (possibly binary protobuf?) to
    support the APIs

  • Testing, of course. At this point the PR doesn't change the behavior
    because the old data storage/cache is still the system of record.

  • I would like more feedback on this implementation or the use of the
    ORM system.

Copy link
Member

@jakubno jakubno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am really sorry, I didn't mentioned it, but we are moving away from entgo, could you please use sqlc.

Here is the setup I currently have.

version: "2"
sql:
  - engine: "postgresql"
    queries: "db/queries"
    schema: "db/migrations"
    gen:
      go:
        emit_pointers_for_null_types: true
        package: "database"
        out: "packages/shared/pkg/database"
        sql_package: "pgx/v5"
        overrides:
          - db_type: "uuid"
            go_type:
              import: "github.com/google/uuid"
              type: "UUID"
          - db_type: "uuid"
            nullable: true
            go_type:
              import: "github.com/google/uuid"
              type: "UUID"
              pointer: true

          - db_type: "pg_catalog.numeric"
            go_type: "github.com/shopspring/decimal.Decimal"
          - db_type: "pg_catalog.numeric"
            nullable: true
            go_type: "*github.com/shopspring/decimal.Decimal"

          - db_type: "timestamptz"
            go_type: "time.Time"
          - db_type: "timestamptz"
            go_type:
              import: "time"
              type: "Time"
              pointer: true
            nullable: true

…rrently-cached-by-the-api-server-about-e2b-1394
@jakubno jakubno added the improvement Improvement for current functionality label Mar 6, 2025
tychoish added 4 commits March 7, 2025 08:34
(cherry picked from commit 2bc8680)
(cherry picked from commit e0bdcf8)
…rrently-cached-by-the-api-server-about-e2b-1394
Copy link
Contributor

@dobrac dobrac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, one thing though, if you include already a way how to query the current state/data, you should be able to write (at least) integration tests (which are now merged) to verify the functionality/behavior

(here is a PR adding the orchestrator client to the integration tests: #403)

Copy link
Member

What is the blocker for merging this?

@tychoish tychoish changed the title feat: persist sandbox information locally in orchestrator Persist sandbox information locally in orchestrator Mar 27, 2025
Copy link
Member

@jakubno jakubno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waiting for tests

@jakubno jakubno marked this pull request as draft May 6, 2025 11:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement for current functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants