Open
Description
I'm testing some FLAME runners to do KMeans clustering. Trying to figure out how many embeddings I can do per fly.io machine size, so I'm testing a bunch of different values until failure.
When I use a value too high, the process crashes, the machine cleans up, and subsequent invocations of the runner error with:
** (ArgumentError) errors were found at the given arguments:
* 1st argument: the table identifier does not refer to an existing ETS table
(stdlib 5.2.3) :ets.lookup_element(:kmeans_md, :meta, 2)
(flame 0.5.1) lib/flame/pool.ex:381: FLAME.Pool.lookup_meta/1
(flame 0.5.1) lib/flame/pool.ex:315: FLAME.Pool.caller_checkout!/5
iex:1: (file)
I'm using these as single_use: true
so I'm not super concerned about the OOM error. If this happened in production though, it'd be a pretty big issue that all subsequent invocations failed.
Not sure how to resolve the issue either - sometimes a restart fixes it, sometimes waiting 10+ minutes. Not a blocker for me (the point of the stress testing is to avoid this in prod), but seemed report-worthy!
—-
PS: Flame configuration:
defmodule KMeans.Supervisor do
use Supervisor
require Logger
alias Machine.Size
def start_link(_) do
Supervisor.start_link(__MODULE__, nil, name: __MODULE__)
end
@impl true
def init(_) do
config = Application.get_env(:flame, :kmeans)
children =
[Size.sm(), Size.md(), Size.lg()]
|> Enum.map(fn size ->
backend =
case config do
{FLAME.FlyBackend, defaults} ->
{FLAME.FlyBackend,
Keyword.merge(defaults,
memory_mb: size.memory * 1024,
cpus: size.cpu
)}
_ ->
{FLAME.LocalBackend, []}
end
{
FLAME.Pool,
backend: backend,
name: size.name,
min: 0,
max: 10,
max_concurrency: 1,
single_use: true,
log: :info,
boot_timeout: :timer.minutes(5),
timeout: :timer.minutes(10),
on_grow_start: &IO.inspect(&1, label: "[FLAME | growing | #{size.name}]"),
on_grow_end: &IO.inspect(Map.put(&2, :status, &1), label: "[FLAME | success | #{size.name}]"),
on_shrink: &IO.inspect(&1, label: "[FLAME | closing | #{size.name}]")
}
end)
Supervisor.init(children, strategy: :one_for_one)
end
end
Metadata
Metadata
Assignees
Labels
No labels
Activity