Skip to content

feat(storage): backend for SQLite on job storage#125

Merged
LeoBorai merged 17 commits intomainfrom
110-job-persistance
Mar 1, 2026
Merged

feat(storage): backend for SQLite on job storage#125
LeoBorai merged 17 commits intomainfrom
110-job-persistance

Conversation

@LeoBorai
Copy link
Owner

@LeoBorai LeoBorai commented Feb 24, 2026

Provides support to store jobs into SQLite so they persist after shutting down/crashing
the mate the server.

@LeoBorai LeoBorai requested a review from Copilot March 1, 2026 16:23
@LeoBorai LeoBorai marked this pull request as ready for review March 1, 2026 16:23
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a SQLite-backed storage layer to persist jobs across process restarts/crashes, replacing the prior in-memory job map in the storage process.

Changes:

  • Introduces a pluggable Backend trait with a SqliteBackend implementation plus SQL migrations.
  • Updates the storage process initialization to accept a home directory and create/use storage.sqlite.
  • Adds SQLx offline workflow support (SQLX_OFFLINE, .sqlx metadata, sqlx-prepare) and documents local dev prerequisites.

Reviewed changes

Copilot reviewed 13 out of 16 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/storage/src/lib.rs Switches storage from in-memory state to a backend abstraction and wires IPC handlers to backend calls.
src/storage/src/backend.rs Adds the Backend trait and exports the sqlite backend module.
src/storage/src/backend/sqlite.rs Implements SQLite persistence using SQLx (migrations, CRUD-ish operations, claiming).
src/storage/src/backend/sqlite/migrations/001_create_table_jobs.sql Creates the jobs table and supporting indexes.
src/storage/Cargo.toml Adds dependencies needed for SQLx/SQLite backend (sqlx, serde_json, async-trait, etc.).
src/mate/src/proto/job.rs Centralizes job status/result strings into constants and uses them in Display/FromStr.
src/cli/src/process/storage.rs Makes StorageProcess::new async and passes the home directory through.
src/cli/src/cli/cmd/component/storage.rs Computes the storage home path and passes it into the storage process.
README.md Documents dev prerequisites (Rust toolchain, SQLx CLI).
Justfile Adds sqlx-prepare helper and a dkc helper task.
Cargo.toml Adds sqlx to workspace dependencies.
Cargo.lock Locks new transitive dependencies from SQLx and sqlite support.
.sqlx/query-ad36629d0f34f5d236818a2ad2a236cf5dc9c1c33cec41a5325cd954000f3e1a.json Adds SQLx offline metadata for the insert query.
.gitignore Ignores .env.
.env.example Provides a sample DATABASE_URL for SQLx tooling.
.cargo/config.toml Enables SQLx offline mode for builds.
Files not reviewed (1)
  • .sqlx/query-ad36629d0f34f5d236818a2ad2a236cf5dc9c1c33cec41a5325cd954000f3e1a.json: Language not supported
Comments suppressed due to low confidence (7)

src/storage/src/lib.rs:94

  • ClaimJobs receives a (start, end) window from the scheduler, but storage discards start and uses SystemTime::now() instead. This can lead to inconsistent behavior if the scheduler and storage clocks differ or if the scheduler ever sends a non-"now" start. Consider passing the provided start through to backend.claim_jobs(...).
            MessagePayload::ClaimJobs((_, end)) => {
                match self
                    .backend
                    .claim_jobs(MAX_JOBS_PER_BATCH, SystemTime::now(), end)
                    .await

src/storage/src/backend/sqlite.rs:174

  • update_job_completed ignores the case where the job id doesn't exist: execute(...) returning 0 affected rows still results in Ok(()). Consider checking rows_affected() and returning an error when no row was updated so callers get a reliable signal (this also matches the previous behavior which surfaced "job not found").
        sqlx::query("UPDATE jobs SET status = ?, result = ?, completed_at = ? WHERE id = ?")
            .bind(status)
            .bind(result_json)
            .bind(completed_at)
            .bind(id)
            .execute(&self.pool)
            .await?;

        Ok(())

src/storage/src/backend/sqlite.rs:124

  • create_job only inserts a subset of Job fields, so persisted rows rely on DB defaults for errors, attempts, and max_attempts. This drops any non-default max_attempts set on the Job before storing (and would also ignore existing errors/attempts if those ever need to be persisted). Consider inserting these fields explicitly (and started_at/completed_at/result when present) to ensure round-tripping is lossless.
    async fn create_job(&self, job: Job) -> Result<Job> {
        let id = job.id.to_string();
        let args = serde_json::to_string(&job.args)?;
        let task = job.task.to_string();
        let status = job.status.to_string();
        let scheduled_at = into_unix_timestamp(job.scheduled_at)?;
        let record = sqlx::query_as!(
            JobRecord,
            r#"
            INSERT INTO jobs (
                id,
                name,
                args,
                status,
                scheduled_at,
                task,
                started_at,
                completed_at
            ) VALUES (
                $1,
                $2,
                $3,
                $4,
                $5,
                $6,
                $7,
                $8
            ) RETURNING *"#,
            id,
            job.name,
            args,
            status,
            scheduled_at,
            task,
            Option::<i64>::None,
            Option::<i64>::None,
        )

src/storage/src/lib.rs:29

  • Storage::new builds the sqlite path via home.to_str().unwrap(), which can panic on non-UTF8 paths, and it doesn't ensure the parent directory exists (e.g. ~/.mate). This can make first-run startup fail. Consider creating the directory with std::fs::create_dir_all(&home_parent) and passing a &Path/PathBuf into SqliteBackend::new instead of unwrapping a UTF-8 string.
    pub async fn new(transport: Box<dyn Transport>, home: PathBuf) -> Result<Self> {
        let ipc = Arc::new(IpcServer::new(IPC_SENDER_STORAGE, transport));
        let home = home.join("storage.sqlite");
        let backend = Arc::new(SqliteBackend::new(home.to_str().unwrap()).await?);

src/storage/src/lib.rs:98

  • On backend errors, QueryJobs and ClaimJobs currently return JobsResult(vec![]), which makes storage/database failures indistinguishable from "no jobs" and can cause the scheduler/API to silently stop seeing jobs. At minimum, log the error; ideally propagate an error payload (or add a dedicated error response) so callers can surface/alert on storage failures.
            MessagePayload::QueryJobs(query) => match self.backend.retrieve_jobs(query).await {
                Ok(jobs) => Some(MessagePayload::JobsResult(jobs)),
                Err(_err) => Some(MessagePayload::JobsResult(vec![])),
            },
            MessagePayload::ClaimJobs((_, end)) => {
                match self
                    .backend
                    .claim_jobs(MAX_JOBS_PER_BATCH, SystemTime::now(), end)
                    .await
                {
                    Ok(jobs) => Some(MessagePayload::JobsResult(jobs)),
                    Err(_err) => Some(MessagePayload::JobsResult(vec![])),
                }

src/cli/src/cli/cmd/component/storage.rs:25

  • home_dir().unwrap() can panic (and std::env::home_dir is deprecated). Since this path is used to place the sqlite DB, consider using a non-deprecated home resolution (or reading from config) and returning a clear error when the home directory can't be determined. Also ensure the ~/.mate directory is created before starting storage so sqlite file creation doesn't fail due to a missing parent directory.
        let config = Config::from_file(&self.config)?;
        let mut home = home_dir().unwrap();
        home.push(".mate");
        let transport = make_transport(config.clone(), ProcessType::Storage).await?;
        let mut storage = StorageProcess::new(transport, home).await?;

src/storage/src/backend/sqlite.rs:220

  • into_system_time casts timestamp: i64 to u64 before building a Duration. If a negative timestamp ever gets into the DB, this will wrap to a huge u64 and can produce a far-future SystemTime instead of erroring. Consider validating timestamp >= 0 before casting, and returning an error for negative values.
fn into_system_time(timestamp: i64) -> Result<SystemTime> {
    SystemTime::UNIX_EPOCH
        .checked_add(Duration::from_secs(timestamp as u64))
        .context("Invalid timestamp")

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 16 changed files in this pull request and generated 4 comments.

Files not reviewed (1)
  • .sqlx/query-ad36629d0f34f5d236818a2ad2a236cf5dc9c1c33cec41a5325cd954000f3e1a.json: Language not supported
Comments suppressed due to low confidence (5)

src/storage/src/lib.rs:29

  • Storage::new builds a filesystem path and then calls home.to_str().unwrap(). This can panic on non-UTF8 paths, and it also doesn’t ensure the parent directory exists (SQLite will fail to create the DB file if ~/.mate/ hasn’t been created). Prefer passing a Path/PathBuf through to SqliteConnectOptions::filename and create_dir_all(home.parent()) before opening the DB.
    pub async fn new(transport: Box<dyn Transport>, home: PathBuf) -> Result<Self> {
        let ipc = Arc::new(IpcServer::new(IPC_SENDER_STORAGE, transport));
        let home = home.join("storage.sqlite");
        let backend = Arc::new(SqliteBackend::new(home.to_str().unwrap()).await?);

src/storage/src/lib.rs:94

  • ClaimJobs discards the start timestamp provided by the scheduler (ClaimJobs((_, end))) and substitutes SystemTime::now(). This can shrink/shift the intended claim window and potentially miss jobs near the boundary. Use the (start, end) values from the message when calling backend.claim_jobs.
            MessagePayload::ClaimJobs((_, end)) => {
                match self
                    .backend
                    .claim_jobs(MAX_JOBS_PER_BATCH, SystemTime::now(), end)
                    .await

src/storage/src/lib.rs:103

  • handle_message no longer handles several MessagePayload variants that other components send to Storage (e.g. UpdateJobStatus from the scheduler, and JobStarted/JobFailed from executors). As a result, job state transitions won’t be persisted/updated. Add match arms for these variants and implement the corresponding backend updates.
            MessagePayload::Ping => Some(MessagePayload::Pong),
            MessagePayload::Shutdown => Some(MessagePayload::ShutdownAck),
            _ => None,
        }

src/storage/src/lib.rs:89

  • On backend errors, QueryJobs returns an empty list. This will make API callers think there are simply no jobs instead of surfacing a storage failure. At minimum, log the error; ideally introduce an error-carrying response variant so callers can distinguish failure from an empty result set.
            MessagePayload::QueryJobs(query) => match self.backend.retrieve_jobs(query).await {
                Ok(jobs) => Some(MessagePayload::JobsResult(jobs)),
                Err(_err) => Some(MessagePayload::JobsResult(vec![])),
            },

src/storage/src/lib.rs:98

  • In the ClaimJobs handler, backend errors are swallowed by returning an empty JobsResult. This can silently stall scheduling while hiding the underlying storage issue. Please at least log the error (and ideally propagate it in a way the scheduler can detect).
                    Ok(jobs) => Some(MessagePayload::JobsResult(jobs)),
                    Err(_err) => Some(MessagePayload::JobsResult(vec![])),
                }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@LeoBorai LeoBorai requested a review from Copilot March 1, 2026 17:15
@LeoBorai LeoBorai changed the title chore: sqlite for local storage feat(storage): backend for SQLite on job storage Mar 1, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 16 changed files in this pull request and generated 2 comments.

Files not reviewed (1)
  • .sqlx/query-111a167ea1da690f9b2fdb8112d0fecf80a6ff4d202adc5772bf195ea7ad3a05.json: Language not supported
Comments suppressed due to low confidence (5)

src/cli/src/cli/cmd/component/storage.rs:25

  • home is set to ~/.mate, but the directory is not created before SQLite is opened. SqliteConnectOptions::create_if_missing(true) creates the DB file, but won’t create missing parent directories, so storage can fail to start on a fresh machine. Consider creating the directory (e.g., create_dir_all) before launching StorageProcess.
        let config = Config::from_file(&self.config)?;
        let mut home = home_dir().context("Failed to get home directory")?;
        home.push(".mate");
        let transport = make_transport(config.clone(), ProcessType::Storage).await?;
        let mut storage = StorageProcess::new(transport, home).await?;

src/storage/src/lib.rs:29

  • home.to_str().unwrap() can panic for non-UTF8 paths, and SqliteConnectOptions::filename(...) accepts a Path-like value. Prefer passing the PathBuf/&Path directly (or returning a proper error) instead of unwrapping.
        let home = home.join("storage.sqlite");
        let backend = Arc::new(SqliteBackend::new(home.to_str().unwrap()).await?);

src/storage/src/lib.rs:89

  • On backend errors, QueryJobs currently returns an empty list, which makes API callers interpret “storage failed” as “no jobs exist”. Consider logging the error and propagating it to callers (e.g., change JobsResult to carry Result<Vec<Job>, String> or add a dedicated error payload).
            MessagePayload::QueryJobs(query) => match self.backend.retrieve_jobs(query).await {
                Ok(jobs) => Some(MessagePayload::JobsResult(jobs)),
                Err(_err) => Some(MessagePayload::JobsResult(vec![])),
            },

src/storage/src/lib.rs:98

  • Similarly, returning JobsResult(vec![]) on claim_jobs failure hides the fact that the scheduler can’t load work and may leave the system idle without any visible error. At minimum, log the error; ideally propagate an error to the scheduler so it can surface/handle it explicitly.
            MessagePayload::ClaimJobs((_, end)) => {
                match self
                    .backend
                    .claim_jobs(MAX_JOBS_PER_BATCH, SystemTime::now(), end)
                    .await
                {
                    Ok(jobs) => Some(MessagePayload::JobsResult(jobs)),
                    Err(_err) => Some(MessagePayload::JobsResult(vec![])),
                }

src/storage/src/backend/sqlite.rs:203

  • update_job_completed doesn’t verify that a row was actually updated (e.g., when the job ID doesn’t exist). This can silently acknowledge success while losing the completion update. Consider checking rows_affected() and returning an error when it’s 0.
            JobResult::Success(_) => {
                sqlx::query(
                    r#"UPDATE jobs
                        SET
                            status = 'completed',
                            result = ?,
                            completed_at = ?,
                            attempts = attempts + 1
                        WHERE id = ?"#,
                )
                .bind(result_json)
                .bind(completed_at)
                .bind(id)
                .execute(&self.pool)
                .await?;
            }
            JobResult::Failure(error) => {
                sqlx::query(
                    r#"UPDATE jobs
                        SET
                            status = 'failed',
                            result = ?,
                            completed_at = ?,
                            attempts = attempts + 1,
                            errors = json_insert(errors, '$[#]', ?)
                        WHERE id = ?"#,
                )
                .bind(result_json)
                .bind(completed_at)
                .bind(error)
                .bind(id)
                .execute(&self.pool)
                .await?;
            }
        }

        Ok(())

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

LeoBorai and others added 2 commits March 1, 2026 18:27
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@LeoBorai LeoBorai merged commit ffab08c into main Mar 1, 2026
8 checks passed
@LeoBorai LeoBorai linked an issue Mar 2, 2026 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Write Jobs to Disk by Default

2 participants