Skip to content

RFC: Inference API - Running onnx models with low-level abstraction #479

Open
@kallebysantos

Description

@kallebysantos

Source Code
Labels: A-ai-api, C-enhancement, E-easy, javascript


RFC: Inference API

Comming from PR #436 the Inference API is an user friendly interface that allows developers easily run their own models using the power of the low level onnx rust backend.

It's based on two core componenents RawSession and RawTensor

  • RawSession: A low level Supabase.ai.Session that can execute any .onnx model. It's recommended for use cases where need more control of the pre/pos-processing steps as well when need to execute linear regression, tabular classification and self-made models.

For common tasks like nlp, audio or computer-vision. The huggingface/transformers.js is recommended, since it already does all the pre/pos-processing stuff.

  • RawTensor: A low level data representation of the model input/output. Inference API's Tensors are fully compatible with Transformers.js Tensors. It means that developers can still be using the high-lavel abstractions that transformers.js provides, like: .sum(), .normalize(), .min()
Examples:

Simple utilization:

Loading a RawSession:

const session = await RawSession.fromHuggingFace('Supabase/gte-small');
// or using the model file url direclty
const session = await RawSession.fromUrl("https://example.com/model.onnx");

Executing a RawSession with RawTensor:

const session = await RawSession.fromUrl("https://example.com/model.onnx");

// Prepare the input tensors
const inputs = {
  input1: new RawTensor("float32", [1.0, 2.0, 3.0], [1, 3]),
  input2: new RawTensor("float32", [4.0, 5.0, 6.0], [1, 3]),
};

const outputs = await session.run(inputs);
console.log(outputs.output1); // Output tensor

Generating embeddings from scratch:

This example demonstrates how Inference API can be used to complex scenarios while taking advantage of Transformers.js high-level functions

import { Tensor } from "@huggingface/transformers.js";
const { RawTensor, RawSession } = Supabase.ai;
   
const session = await RawSession.fromHuggingFace('Supabase/gte-small');
   
// Example only, in real 'feature-extraction' tensors are given from the tokenizer step. 
// Consider 'n' as the batch size
const inputs = {
   input_ids: new RawTensor('float32', [1, 2, 3...], [n, 2]),
   attention_mask: new RawTensor('float32', [...], [n, 2]),
   // @ts-ignore: mixing Tensors from both
   token_types_ids: new Tensor('float32', [...], [n, 2])
};
   
const { last_hidden_state } = await session.run(inputs);
   
// Using `transformers.js` APIs
const hfTensor = Tensor.mean_pooling(last_hidden_state, inputs.attention_mask).normalize();
   
return hfTensor.tolist();

Self-made models

This example ilustrate how users can train their own model and execute it direclty from edge-runtime

Here you can check a Deployable example of it, with the current Supa stack

The model was trained to expect the following object payload

[
  {
    "Model_Year": 2021,
    "Engine_Size": 2.9,
    "Cylinders": 6,
    "Fuel_Consumption_in_City": 13.9,
    "Fuel_Consumption_in_City_Hwy": 10.3,
    "Fuel_Consumption_comb": 12.3,
    "Smog_Level": 3,
  },
  {
    "Model_Year": 2023,
    "Engine_Size": 2.4,
    "Cylinders": 4,
    "Fuel_Consumption_in_City": 9.9,
    "Fuel_Consumption_in_City_Hwy": 7.0,
    "Fuel_Consumption_comb": 8.6,
    "Smog_Level": 3,
  }
]

Then the model inference can done inside a common Edge Function

const { RawTensor, RawSession } = Supabase.ai;

// Custom filename on Hugging Face, default: 'model_quantized.onnx'
const session = await RawSession.fromHuggingFace('kallebysantos/vehicle-emission', {
  path: {
    modelFile: 'model.onnx',
  },
});

Deno.serve(async (req: Request) => {
  const carsBatchInput = await req.json();

  // Parsing objects to tensor input
  const inputTensors = {};
  session.inputs.forEach((inputKey) => {
    const values = carsBatchInput.map((item) => item[inputKey]);

    // This model uses `float32` tensors, but could variate to mixed types
    inputTensors[inputKey] = new RawTensor('float32', values, [values.length, 1]);
  });

  const { emissions } = await session.run(inputTensors);

  return Response.json({ result: emissions });  // [ 289.01, 199.53]
});
Type definitions

This typescript definitions should be added to supabase/functions-js

declare namespace Supabase {
  /**
   * Provides AI related APIs
   */
  export interface Ai {
    /** Provides an user friendly interface for the low level *onnx backend API*.
     * A `RawSession` can execute any *onnx* model, but we only recommend it for `tabular` or *self-made* models, where you need mode control of model execution and pre/pos-processing.
     * Consider a high-level implementation like `@huggingface/transformers.js` for generic tasks like `nlp`, `computer-vision` or `audio`.
     *
     * **Example:**
     * ```typescript
     * const session = await RawSession.fromHuggingFace('Supabase/gte-small');
     * // const session = await RawSession.fromUrl("https://example.com/model.onnx");
     *
     * // Prepare the input tensors
     * const inputs = {
     *   input1: new Tensor("float32", [1.0, 2.0, 3.0], [3]),
     *   input2: new Tensor("float32", [4.0, 5.0, 6.0], [3]),
     * };
     *
     * // Run the model
     * const outputs = await session.run(inputs);
     *
     * console.log(outputs.output1); // Output tensor
     * ```
     */
    readonly RawSession: typeof RawSession;

    /** A low level representation of model input/output.
     * Supabase's `Tensor` is totally compatible with `@huggingface/transformers.js`'s `Tensor`. It means that you can use its high-level API to apply some common operations like `sum()`, `min()`, `max()`, `normalize()` etc...
     *
     * **Example: Generating embeddings from scratch**
     * ```typescript
     * import { Tensor } from "@huggingface/transformers.js";
     * const { RawTensor, RawSession } = Supabase.ai;
     *
     * const session = await RawSession.fromHuggingFace('Supabase/gte-small');
     *
     * // Example only, in real 'feature-extraction' tensors are given from the tokenizer step.
     * const inputs = {
     *    input_ids: new RawTensor('float32', [...], [n, 2]),
     *    attention_mask: new RawTensor('float32', [...], [n, 2]),
     *    token_types_ids: new Tensor('float32', [...], [n, 2]) // Hugging face tensor
     * };
     *
     * const { last_hidden_state } = await session.run(inputs);
     *
     * // Using `transformers.js` APIs
     * const hfTensor = HFTensor.mean_pooling(last_hidden_state, inputs.attention_mask).normalize();
     *
     * return hfTensor.tolist();
     *
     * ```
     */
    readonly RawTensor: typeof RawTensor;
  }

  /**
   * Provides AI related APIs
   */
  export const ai: Ai;

  export type TensorDataTypeMap = {
    float32: Float32Array | number[];
    float64: Float64Array | number[];
    string: string[];
    int8: Int8Array | number[];
    uint8: Uint8Array | number[];
    int16: Int16Array | number[];
    uint16: Uint16Array | number[];
    int32: Int32Array | number[];
    uint32: Uint32Array | number[];
    int64: BigInt64Array | number[];
    uint64: BigUint64Array | number[];
    bool: Uint8Array | number[];
  };

  export type TensorMap = { [key: string]: Tensor<keyof TensorDataTypeMap> };

  export class RawTensor<T extends keyof TensorDataTypeMap> {
    /**  Type of the tensor. */
    type: T;

    /** The data stored in the tensor. */
    data: TensorDataTypeMap[T];

    /**  Dimensions of the tensor. */
    dims: number[];

    /** The total number of elements in the tensor. */
    size: number;

    constructor(type: T, data: TensorDataTypeMap[T], dims: number[]);
  }

  export class RawSession {
    /**  The underline session's ID.
     * Session's ID are unique for each loaded model, it means that even if a session is constructed twice its will share the same ID.
     */
    id: string;

    /** A list of all input keys the model expects. */
    inputs: string[];

    /** A list of all output keys the model will result. */
    outputs: string[];

    /** Loads a ONNX model session from source URL.
     * Sessions are loaded once, then will keep warm cross worker's requests
     */
    static fromUrl(source: string | URL): Promise<RawSession>;

    /** Loads a ONNX model session from **HuggingFace** repository.
     * Sessions are loaded once, then will keep warm cross worker's requests
     */
    static fromHuggingFace(repoId: string, opts?: {
      /**
       * @default 'https://huggingface.co'
       */
      hostname?: string | URL;
      path?: {
        /**
         * @default '{REPO_ID}/resolve/{REVISION}/onnx/{MODEL_FILE}?donwload=true'
         */
        template?: string;
        /**
         * @default 'main'
         */
        revision?: string;
        /**
         * @default 'model_quantized.onnx'
         */
        modelFile?: string;
      };
    }): Promise<RawSession>;

    /** Run the current session with the given inputs.
     * Use `inputs` and `outputs` properties to know the required inputs and expected results for the model session.
     *
     * @param inputs The input tensors required by the model.
     * @returns The output tensors generated by the model.
     *
     * @example
     * ```typescript
     * const session = await RawSession.fromUrl("https://example.com/model.onnx");
     *
     * // Prepare the input tensors
     * const inputs = {
     *   input1: new RawTensor("float32", [1.0, 2.0, 3.0], [3]),
     *   input2: new RawTensor("float32", [4.0, 5.0, 6.0], [3]),
     * };
     *
     * // Run the model
     * const outputs = await session.run(inputs);
     *
     * console.log(outputs.output1); // Output tensor
     * ```
     */
    run(inputs: TensorMap): Promise<TensorMap>;
  }
}

Some ideas that could be also implemented:

  • Add Supabase Storage integration
  • Possibility to edit request headers for external authentication
  • Fine control of the Session Id
  • Model size constraints, check the size before downloading the model

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions