RFC: Inference API

> [Source Code](https://github.com/kallebysantos/edge-runtime/commit/b2e28cbc53a75d0a14616b76121183e232b6b1d8)
> **Labels:** [A-ai-api](https://github.com/supabase/edge-runtime/labels/A-ai-api), [C-enhancement](https://github.com/supabase/edge-runtime/labels/C-enhancement), [E-easy](https://github.com/supabase/edge-runtime/labels/E-easy), [javascript](https://github.com/supabase/edge-runtime/labels/javascript) 
---
# RFC: Inference API 

Comming from [PR #436](https://github.com/supabase/edge-runtime/pull/436) the **Inference API** is an user friendly interface that allows developers easily run their own models using the power of the low level `onnx rust backend`. 

It's based on two core componenents `RawSession` and `RawTensor`

- `RawSession`: A low level `Supabase.ai.Session` that can execute any `.onnx` model. It's recommended for use cases where need more control of the pre/pos-processing steps as well when need to execute `linear regression`, `tabular classification` and *self-made* models.
> For common tasks like `nlp`, `audio` or `computer-vision`. The [huggingface/transformers.js](https://github.com/huggingface/transformers.js) is recommended, since it already does all the pre/pos-processing stuff.

- `RawTensor`: A low level data representation of the model input/output. Inference API's Tensors are fully compatible with Transformers.js Tensors. It means that developers can still be using the high-lavel abstractions that `transformers.js` provides, like: `.sum()`, `.normalize()`, `.min()`

<details>
  <summary>Examples:</summary>

  ### Simple utilization:
  
  Loading a `RawSession`:
  ```ts
  const session = await RawSession.fromHuggingFace('Supabase/gte-small');
  // or using the model file url direclty
  const session = await RawSession.fromUrl("https://example.com/model.onnx");
  ```
  
  Executing a `RawSession` with `RawTensor`:

  ```ts
  const session = await RawSession.fromUrl("https://example.com/model.onnx");
 
  // Prepare the input tensors
  const inputs = {
    input1: new RawTensor("float32", [1.0, 2.0, 3.0], [1, 3]),
    input2: new RawTensor("float32", [4.0, 5.0, 6.0], [1, 3]),
  };

  const outputs = await session.run(inputs);
  console.log(outputs.output1); // Output tensor
  ```

  ### Generating embeddings from scratch:
  
  This example demonstrates how **Inference API** can be used to complex scenarios while taking advantage of **Transformers.js** high-level functions

  ```ts
  import { Tensor } from "@huggingface/transformers.js";
  const { RawTensor, RawSession } = Supabase.ai;
     
  const session = await RawSession.fromHuggingFace('Supabase/gte-small');
     
  // Example only, in real 'feature-extraction' tensors are given from the tokenizer step. 
  // Consider 'n' as the batch size
  const inputs = {
     input_ids: new RawTensor('float32', [1, 2, 3...], [n, 2]),
     attention_mask: new RawTensor('float32', [...], [n, 2]),
     // @ts-ignore: mixing Tensors from both
     token_types_ids: new Tensor('float32', [...], [n, 2])
  };
     
  const { last_hidden_state } = await session.run(inputs);
     
  // Using `transformers.js` APIs
  const hfTensor = Tensor.mean_pooling(last_hidden_state, inputs.attention_mask).normalize();
     
  return hfTensor.tolist();
  ```
  
  ### Self-made models
  This example ilustrate how users can [train their own model](https://colab.research.google.com/drive/1m1eEt8Y2vlQV9Bi30H124eXjKp2zx_eg?usp=sharing) and execute it direclty from `edge-runtime` 
> Here you can check a [Deployable example](https://github.com/kallebysantos/testing.ml.vehicle-emission) of it, with the current Supa stack
  
  The model was trained to expect the following object payload
  ```json
  [
    {
      "Model_Year": 2021,
      "Engine_Size": 2.9,
      "Cylinders": 6,
      "Fuel_Consumption_in_City": 13.9,
      "Fuel_Consumption_in_City_Hwy": 10.3,
      "Fuel_Consumption_comb": 12.3,
      "Smog_Level": 3,
    },
    {
      "Model_Year": 2023,
      "Engine_Size": 2.4,
      "Cylinders": 4,
      "Fuel_Consumption_in_City": 9.9,
      "Fuel_Consumption_in_City_Hwy": 7.0,
      "Fuel_Consumption_comb": 8.6,
      "Smog_Level": 3,
    }
  ]
  ```
  
  Then the model inference can done inside a common `Edge Function`
  ```ts
  const { RawTensor, RawSession } = Supabase.ai;

  // Custom filename on Hugging Face, default: 'model_quantized.onnx'
  const session = await RawSession.fromHuggingFace('kallebysantos/vehicle-emission', {
    path: {
      modelFile: 'model.onnx',
    },
  });

  Deno.serve(async (req: Request) => {
    const carsBatchInput = await req.json();

    // Parsing objects to tensor input
    const inputTensors = {};
    session.inputs.forEach((inputKey) => {
      const values = carsBatchInput.map((item) => item[inputKey]);

      // This model uses `float32` tensors, but could variate to mixed types
      inputTensors[inputKey] = new RawTensor('float32', values, [values.length, 1]);
    });

    const { emissions } = await session.run(inputTensors);
 
    return Response.json({ result: emissions });  // [ 289.01, 199.53]
  });
  ```
</details>

<details>
  <summary>Type definitions</summary>

  This typescript definitions should be added to [`supabase/functions-js`](https://github.com/supabase/functions-js/blob/main/src/edge-runtime.d.ts)

  ```ts
  declare namespace Supabase {
    /**
     * Provides AI related APIs
     */
    export interface Ai {
      /** Provides an user friendly interface for the low level *onnx backend API*.
       * A `RawSession` can execute any *onnx* model, but we only recommend it for `tabular` or *self-made* models, where you need mode control of model execution and pre/pos-processing.
       * Consider a high-level implementation like `@huggingface/transformers.js` for generic tasks like `nlp`, `computer-vision` or `audio`.
       *
       * **Example:**
       * ```typescript
       * const session = await RawSession.fromHuggingFace('Supabase/gte-small');
       * // const session = await RawSession.fromUrl("https://example.com/model.onnx");
       *
       * // Prepare the input tensors
       * const inputs = {
       *   input1: new Tensor("float32", [1.0, 2.0, 3.0], [3]),
       *   input2: new Tensor("float32", [4.0, 5.0, 6.0], [3]),
       * };
       *
       * // Run the model
       * const outputs = await session.run(inputs);
       *
       * console.log(outputs.output1); // Output tensor
       * ```
       */
      readonly RawSession: typeof RawSession;

      /** A low level representation of model input/output.
       * Supabase's `Tensor` is totally compatible with `@huggingface/transformers.js`'s `Tensor`. It means that you can use its high-level API to apply some common operations like `sum()`, `min()`, `max()`, `normalize()` etc...
       *
       * **Example: Generating embeddings from scratch**
       * ```typescript
       * import { Tensor } from "@huggingface/transformers.js";
       * const { RawTensor, RawSession } = Supabase.ai;
       *
       * const session = await RawSession.fromHuggingFace('Supabase/gte-small');
       *
       * // Example only, in real 'feature-extraction' tensors are given from the tokenizer step.
       * const inputs = {
       *    input_ids: new RawTensor('float32', [...], [n, 2]),
       *    attention_mask: new RawTensor('float32', [...], [n, 2]),
       *    token_types_ids: new Tensor('float32', [...], [n, 2]) // Hugging face tensor
       * };
       *
       * const { last_hidden_state } = await session.run(inputs);
       *
       * // Using `transformers.js` APIs
       * const hfTensor = HFTensor.mean_pooling(last_hidden_state, inputs.attention_mask).normalize();
       *
       * return hfTensor.tolist();
       *
       * ```
       */
      readonly RawTensor: typeof RawTensor;
    }

    /**
     * Provides AI related APIs
     */
    export const ai: Ai;

    export type TensorDataTypeMap = {
      float32: Float32Array | number[];
      float64: Float64Array | number[];
      string: string[];
      int8: Int8Array | number[];
      uint8: Uint8Array | number[];
      int16: Int16Array | number[];
      uint16: Uint16Array | number[];
      int32: Int32Array | number[];
      uint32: Uint32Array | number[];
      int64: BigInt64Array | number[];
      uint64: BigUint64Array | number[];
      bool: Uint8Array | number[];
    };

    export type TensorMap = { [key: string]: Tensor<keyof TensorDataTypeMap> };

    export class RawTensor<T extends keyof TensorDataTypeMap> {
      /**  Type of the tensor. */
      type: T;

      /** The data stored in the tensor. */
      data: TensorDataTypeMap[T];

      /**  Dimensions of the tensor. */
      dims: number[];

      /** The total number of elements in the tensor. */
      size: number;

      constructor(type: T, data: TensorDataTypeMap[T], dims: number[]);
    }

    export class RawSession {
      /**  The underline session's ID.
       * Session's ID are unique for each loaded model, it means that even if a session is constructed twice its will share the same ID.
       */
      id: string;

      /** A list of all input keys the model expects. */
      inputs: string[];

      /** A list of all output keys the model will result. */
      outputs: string[];

      /** Loads a ONNX model session from source URL.
       * Sessions are loaded once, then will keep warm cross worker's requests
       */
      static fromUrl(source: string | URL): Promise<RawSession>;

      /** Loads a ONNX model session from **HuggingFace** repository.
       * Sessions are loaded once, then will keep warm cross worker's requests
       */
      static fromHuggingFace(repoId: string, opts?: {
        /**
         * @default 'https://huggingface.co'
         */
        hostname?: string | URL;
        path?: {
          /**
           * @default '{REPO_ID}/resolve/{REVISION}/onnx/{MODEL_FILE}?donwload=true'
           */
          template?: string;
          /**
           * @default 'main'
           */
          revision?: string;
          /**
           * @default 'model_quantized.onnx'
           */
          modelFile?: string;
        };
      }): Promise<RawSession>;

      /** Run the current session with the given inputs.
       * Use `inputs` and `outputs` properties to know the required inputs and expected results for the model session.
       *
       * @param inputs The input tensors required by the model.
       * @returns The output tensors generated by the model.
       *
       * @example
       * ```typescript
       * const session = await RawSession.fromUrl("https://example.com/model.onnx");
       *
       * // Prepare the input tensors
       * const inputs = {
       *   input1: new RawTensor("float32", [1.0, 2.0, 3.0], [3]),
       *   input2: new RawTensor("float32", [4.0, 5.0, 6.0], [3]),
       * };
       *
       * // Run the model
       * const outputs = await session.run(inputs);
       *
       * console.log(outputs.output1); // Output tensor
       * ```
       */
      run(inputs: TensorMap): Promise<TensorMap>;
    }
  }

  ```
</details>

---

Some ideas that could be also implemented:
- Add Supabase Storage integration
- Possibility to edit request headers for external authentication
- Fine control of the Session Id
- Model size constraints, check the size before downloading the model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

RFC: Inference API - Running `onnx` models with low-level abstraction #479

Simple utilization:

Generating embeddings from scratch:

Self-made models

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

RFC: Inference API - Running onnx models with low-level abstraction #479

Description

RFC: Inference API

Simple utilization:

Generating embeddings from scratch:

Self-made models

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

RFC: Inference API - Running `onnx` models with low-level abstraction #479