[RFC] etLLM: LLM via ExecuTorch #8228

iseeyuan · 2025-02-05T20:27:53Z

iseeyuan
Feb 5, 2025
Collaborator

tl;dr

There are several issues for users to deploy LLM on devices using ExecuTorch with the existing flow. The goal of this RFC is to streamline the end-to-end on-device LLM deployment flow via ExecuTorch. Potential use cases:

Research/Development: I want to quickly put an LLM on a test device and tune the accuracy/performance.
Production: I want fast frictionless integration to my Android or iOS app.

Some successful metrics of this project:

Time to market: the time to enable a new LLM be reduced from weeks to 2-3 days, with acceptable performance on all supported backends and quantization schemes by the community.
ExecuTorch being the most preferred solution to run a new LLM on Android and iOS devices, wrt model support, performance and devX.

No goals

This project is not about deploying an arbitrary non-LLM model via ExecuTorch. However, it can be a critical part of “ExecuTorch just works”.
It’s not trying to build a model zoo for a big number of LLM models.

Context

Popular LLMs share similar transformer based architectures. The fixed architecture brings some convenience on deployment. An example is llama.cpp. However, when deploying to a variety of backends, the flows can be different due to different limitations of the backends. Those limitations (from software point of view, which may be rooted by HW limitations like available memories and NPU limitations) include

Static shapes vs. dynamic shapes
Static quantization vs. dynamic quantization
Data types a backend supports
Kernels available in a backend
Different types of the attention layer
How to handle the KV Cache

Note that the limitations may apply for other model architectures as well. For example, the diffusion models.

The existing ExecuTorch flow is explained in this readme. There are a number of pains with the existing flow.

Too many options and error prone

There’s a long cli with a number of args. Users may feel difficulties to understand the details of args in each step: export, dynamic shapes, quantization schemes, etc. For one backend, only one combination of those args work, and it’s error prone (an example is different quantization primitives affecting the accuracy when exporting the new LLama model).

Complicated building command

There’s two building stages, for a runtime, with a long cli of CMake config/build.

Missing parts that other frameworks have

An example is LoRA, which is supported by llama.cpp, ONNX GenAI, etc.

Scattered codes

Sometimes, updating the exporting recipe is not sufficient or efficient. Supporting a specific backend may involve a different copy of model definition (and maybe QAT), and a different version of runtime code. On the other hand, we can see the potential trend of scale due to more use cases and models to be supported. In the appendix there’s a table to summarize those existing code versions, their unique properties and use cases.

Works for models with Llama architecture only

It works for Llama models (and those models with exactly the same architecture). There’s significant support for a new LLM model, with types categorized below.

Types of new LLM models

There are three types of “new” LLM models from the inference point of view:

There’s only weight and configuration change. For example, From Llama3.1 to Llama3.2, and to other models like Qwen, Phi-3, etc.
There is a significant difference in some components. For example, the DeepSeek models use Mixture of Expert (MoE) on FFN. Multi-head Latent Attention (MLA) is used vs. Multi Head Attention (MHA). However, the popular varieties are limited.
A new model is composed of a transformer model(s) and other parts. For example, the vision encoder, CLIP, is composed of transformers and other embeddings and layer norms.

RFC

We are exploring efficient solutions for those types. The high-level design thoughts are:

Targeting at the smooth end to end flow, identify gaps and propose solutions.
Hide the implementation details, but expose the necessary configurations to users.
Reuse as much as possible existing flows.

Note: the diagram is to provide a big picture of how the end to end flow looks like. We’ll focus on entry points and related parts, and mention other components for reference.

Entry points

export_llm

The interface can be as simple as:

python -m export_llm --check_point="ckp.pt" --model_config="model_config.json" --export_config="export_config.yaml"

# Alternatively, directly downloading from Hugging Face
python -m export_llm --hf_id="meta-llama/Llama-3.2-1B" --export_config="export_config.yaml"

# Python API to export other transformer models
_export_llm(nn.Module, config)

The first two APIs can be used for LLM models of type 1 above, and the third API can be applied to all three types.

The check_point and model_config arguments are the same as existing export_llama. Compared to the existing flow, the differences are:

It can be extended to other LLMs (other than Llama).
The export configs are aggregated to one file, instead of a long list of arguments. The content of this yaml file should be user-facing, like target backend, quantization scheme, etc.
We add the Python API _export_llm to take two input arguments:
- nn.Module, especially other transformer models like multimodal encoder/decoder (CLIP, Mimi, etc.)
- A config, where the config system works without a cli.

Usage

# users can use this experimental API with their own models
from executorch.extensions.llm import _export_llm

# Sould we expose the low-level utils?
# Source transform lib 
from executorch.extensions.llm import source_transforms

Runtime

Runtime is another user-facing entry point. It’s deployed to PC, Android or iOS, with the capability of loading and running the LLM models. Currently, there are APIs in

Python (pybinding) API as an efficient connection between eager and on-device
C++ for on-device
Java and swift APIs for Android and iOS?
Simpler and stable building instructions.

The LLM model artifacts (.pte files) can be from

downloaded from hf.com/executorch-community
using optimum-executorch file
executorch.export API

Ref: Mengwei’s RFC: LLM Runner APIs

Relationship between etLLM and other libraries like HF optimum-executorch

We are not hosting a model zoo. We are not reimplementing a lot of what HF transformers do, or host all model definitions.
Instead, we rely on the users’ (or HF) model definition, and do transforms based on that.

Specifically, we provide ET-friendly and performant components, so that users can swap the components with those in the existing model. An example is SDPA (context: SDPA Interfaces). There are three places we can do such transforms:

At source level: source swapping
- HF Registration to swap the default implementations (Guang's optimum PR on SDPA).
- Source transform passes
At exported graph level: we provided graph transformation
- It may overlap some with source swapping. We can use Mimi as a prototype on details, pros and cons
At executorch dialect: an example is in-place kv cache update
- similar to graph transform, but a pass exists in ExecuTorch dialect to convert copies to in-place update.

The ET-friendly and performant artifacts (component definitions, source transform passes, graph transform passes etc) are provided under the etLLM umbrella, and can be used in existing model definitions, including HF model definitions, or users’ native LLM model definitions.

GregoryComer · 2025-02-06T00:44:31Z

GregoryComer
Feb 6, 2025
Collaborator

@iseeyuan Do you think it would be possible to get to a point where we can meet model definitions where they are - even if we're maybe getting 50%-80% of theoretical peak performance? The reason I ask is that there is a significant burden involved in writing the model the way that ET expects. If we could provide a "peak performance path" and an "out of box / just works" path, that would be very nice. My experience with some internal teams is that they try to run existing language models on ET and drop it when the out of box performance is far behind what they expect.

Aside from that, I think standardizing the language model APIs will be a big win for usability. Thanks for putting this together.

11 replies

kimishpatel Feb 11, 2025
Collaborator

(b) We can develop the export flow and expose necessary config to the users. @tarun292 is looking at some high-level export flows to wrap the technical details.

@iseeyuan I think what @GregoryComer means is that, if someone wanted to adapt model definition or reuse components from it, like custom ops etc., it is not clear how user can do that for their own model.

cccclai Feb 11, 2025
Collaborator

I wouldn't worry about export or kernel changes too much in terms of performance regression, because export is about capturing graph semantics, and kernel won't be related because the graph is completely lowered to NPU backends. The example case I provided is a real case and it's hard to capture if there is a regression. There can be lots of engineers' hours to optimize the performance while it's not very visible from PyTorch source code level, while it can be obvious on graph level. That's also one reason for the multiple model definitions and each backend can focus on its own optimization while not regressing other backends, because different backends' kernel can have different flavors.

kimishpatel Feb 11, 2025
Collaborator

@cccclai can your concern be address via appropriate source transformation? Like I saw you added a few variants of KVCache implementation. Or if kv cache was IO do you feel that such optimizations are not possible?

iseeyuan Feb 12, 2025
Collaborator Author

@cccclai

That's also one reason for the multiple model definitions and each backend can focus on its own optimization while not regressing other backends, because different backends' kernel can have different flavors.

This RFC is not trying to unify the source definition to a single one. That's why I added the attention interface and @sxu added the static_attention. I'm just trying to unify the flow to start from a single entry point, instead of scattered copies around different locations.

while it's not very visible from PyTorch source code level, while it can be obvious on graph level

But graph is not the format most of direct users familiar with. And the improvement you made in your PR is a good example on how we should keep our own copy of transformer codes to maintain the performance.

I'd still argue that perf effect is not just from source code.

@kimishpatel Yes, we should use kv cache as IO. It's another example of a source code and flow that we maintain, and may help the performance, instead of an arbitrary source code.

cccclai Feb 12, 2025
Collaborator

@cccclai can your concern be address via appropriate source transformation? Like I saw you added a few variants of KVCache implementation. Or if kv cache was IO do you feel that such optimizations are not possible?

For this particular examples, yes. My question is mostly about that, while we're trying to unify code, we might end up with perf regression, because the ops used under the PyTorch source code isn't obvious.

perf effect is not just from source code.

The PR I shared is exactly the evidence that the source code change will affect performance.

This RFC is not trying to unify the source definition to a single one

There are still sort of connections parts from modules to modules, and some components might be shared by multiple backends, or maybe not. My question is how to guarantee there won't be perf regression to the other backends if one backend attempts to make those changes to the share parts.

kimishpatel · 2025-02-11T23:07:05Z

kimishpatel
Feb 11, 2025
Collaborator

meta comment: you need to format the RFC a bit. Indentation and spacing is off that it makes it a hard read

1 reply

iseeyuan Feb 12, 2025
Collaborator Author

My bad of copying from a Google doc. Will do a one-time update together with all the feedback.

kimishpatel · 2025-02-11T23:17:11Z

kimishpatel
Feb 11, 2025
Collaborator

@iseeyuan I feel that you should split this RFC into two. One for model architecture definition and one for export_llama_lib refactoring.
My reasoning is that
a: model definition as it relates to model API is the most user visible thing and the only thing that comes to mind here is kv cache as IO vs. not. Input_pos as part of model's API. The rest, such as MHA vs. SHA, different variants of attention etc. are the kind of transformations, at source level or graph level, that can be captured in b.
b. Most optimizations should be applied as part of export_llama_lib. So we should refactor this, as presented here, export_coreml, export_qnn etc. ANd we can leverage structure like Tester that @digantdesai wrote for XNNPACK. The "Exporter" can have different stages each amenable to different kind of transformations. This is kind of already the case with builder https://github.com/pytorch/executorch/blob/main/extension/llm/export/builder.py#L143. So maybe builder should be extensible by backends. This will simplify menagerie of options currently in export_llama.

2 replies

iseeyuan Feb 12, 2025
Collaborator Author

@kimishpatel This RFC is focused on the end to end user experience of deploying transformers (or checkpoints) to devices, instead of just focusing on one individual step. When we see the full picture of how all steps are working together, we can create sub projects. For example, what can be captured in b and what can be done in a, we need this RFC to define the boundary. It includes runtime parts as well. I'm talking to @guangy10 and brainstorm on the benchmark, which is a critical part of the deployment iteration.

We could start from the builder, and expose user-facing APIs, and use components of the internal tool with recipes. @tarun292 has an issue on a high-level export APIs that can hide the details but still expose enough API for users.

Could you share the code pointer of the Tester that @digantdesai wrote?

digantdesai Feb 21, 2025
Collaborator

Oh yes - here is the tester and here is a simple example on how to use it.

The design goal was to encapsulate the details inside a Stage, which is fully customizable. You make a pipeline by connecting these stages, which can be interacted with, while being Stage type agnostic.

As Kimish said, we can have different "transforms" as Stages, in addition to normal stages, which can optionally be part of the pipeline.

guangy10 · 2025-02-12T02:38:54Z

guangy10
Feb 12, 2025
Collaborator

Currently rewriting is not avoidable for export and backend limitations

To what extent is this true? If someone doesn't have permission to rewrite the modeling code, does that mean the model won't work for that backend at all? Or will it still work but just not achieve the best performance? Maybe @kimishpatel @digantdesai @cccclai can comment about it?
I think my question implies what @GregoryComer mentioned, that is, there could be separate paths for "peak performance" and "out-of-the-box" performance.

[RFC1] Provide tools to help the source code rewrite.

I agree that we should provide tools to make source code rewriting easier. However, I don’t think rewriting should always be done in the original modeling code, as this could impact dozens of models and make OSS contributions increasingly difficult as more models are covered. (This is how I interpret the proposal, given its goal of unifying code and reducing boilerplate.)

For example, as a user, if the code I modify could affect the performance of numerous models and use cases, I’d be hesitant to make changes and would likely defer to ET developers instead. This not only places us at the center of enablement and improvement but also increases the risk of making contributions more intimidating.

For the second layer of rewrite (backend specific), we are seeking for code sharing to avoid redundant work.

Many of the proposed ideas already exist in HF. For example, the ability to add and register different attention implementations is already supported (pointer). Additionally, the lifted cache is already exported as IO in Exported IR (example).

My impression is that this proposal is leaning toward consolidating HF Transformers' definitions and own it in our repo, aiming to support as many transformer models as possible—including text, audio, and vision transformers. Can this approach scale effectively?

One of the core principles HF Transformers upholds is "single model, single file" (as mentioned at PTC 2024). I believe they are fully aware of the downside of this approach—namely, redundant code—but it provides significant flexibility in isolating performance impacts across models and reduces the complexity of performance testing. So far, this strategy has proven highly successful.

Runtime code should be as backend-agnostic as possible. Some features should be modularized and a library of those features to be provided.

I want to second this. Some ML engineers who just want to prototype quickly in Python shouldn’t need to be aware of the runtime code (C++). Take HF workflow as an example—good UX means an ML engineer should be able to experiment with different models and recipes, validating end-to-end in Python without needing any knowledge of the underlying runtime. This requires the interface to runtime(s) to be not only backend-agnostic but also model-agnostic.

When supporting internal and open source projects, we realized that there are a number of transformer definitions, export flows and runtimes existing in both OSS and internal repos.

Back to the key problem highlighted in this proposal, having multiple modeling sources in our repo is indeed a challenge, but is having multiple modeling sources itself a problem? I see these as two distinct issues, and the latter doesn’t seem avoidable—it will happen somewhere regardless.

We will focus on transformer decoders at the moment

You mean decoder-only transformers, right? What about encoder-only transformers (like BERT) and encoder-decoder transformers (like T5)? What’s the plan for non-transformer models, such as diffusion models or Timm models? If we’re heading down this path, I think we need to consider the full picture.

Q: Should the ExecuTorch repo serve as a recipe repository? If so, how many recipes do you expect to host in the ExecuTorch repo?

This proposal seems to imply that the ExecuTorch repo will also function as a recipe repository.

I agree that providing a default recipe for each backend makes sense. However, that alone doesn’t justify the need to host these recipes within ExecuTorch. Some of the proposed ideas, such as controlling recipes via a configuration file, are already well-supported by Hugging Face not just for eager, but also for ONNX and TFLite. Why is it necessary to rebuild a similar mechanism and maintain it in our repo?

From the perspective of building a vibrant community, I think it is key that recipes are separated from the core. While we can offer a default recipe for each backend as an option, we shouldn’t restrict users to copying and customizing them for their own needs. To encourage organic community growth, users should be able to create as many recipes as they want and make recipes shareable so that other OSS users can benefit. This level of openness wouldn’t be possible if recipes were tightly coupled in our repo.

6 replies

guangy10 Feb 12, 2025
Collaborator

Yeah if a model can't export, it has to be fixed anyway regardless of the backend? Can we expect most delegates will work out-of-the-box if a model is exportable though it may not provide the best performance?

larryliu0820 Feb 13, 2025
Collaborator

controlling recipes via a configuration file, are already well-supported by Hugging Face not just for eager, but also for ONNX and TFLite.

@guangy10 I'm curious to learn more about this, can ONNX / TFLite directly work on the transformers model definition without any rewrite?

If they can, then for ExecuTorch it's slightly different because we need some level of model rewrite, can the rewritten model definition be hosted in HF as well?

If they also need some model rewrite, how is this done in HF?

jackzhxng Feb 13, 2025
Collaborator

Some of the proposed ideas, such as controlling recipes via a configuration file, are already well-supported by Hugging Face not just for eager, but also for ONNX and TFLite. Why is it necessary to rebuild a similar mechanism and maintain it in our repo?

We already control our llama_transformer.py via the configuration file (params.json), which maps to ModelArgs. These configurations modify how the llama_transformer behaves, such as n_layers, moe, etc. If we add a bit more modularity, we can start capturing more models with this llama_transformer, such as Qwen 2.5 which I was able to easily implement here.

From the perspective of building a vibrant community, I think it is key that recipes are separated from the core. While we can offer a default recipe for each backend as an option, we shouldn’t restrict users to copying and customizing them for their own needs.

I agree, but I don't think the existence of this new framework prevents users from doing this. They can still write their own model definitions and export/lower manually.

guangy10 Feb 14, 2025
Collaborator

controlling recipes via a configuration file, are already well-supported by Hugging Face not just for eager, but also for ONNX and TFLite.

@guangy10 I'm curious to learn more about this, can ONNX / TFLite directly work on the transformers model definition without any rewrite?

If they can, then for ExecuTorch it's slightly different because we need some level of model rewrite, can the rewritten model definition be hosted in HF as well?

If they also need some model rewrite, how is this done in HF?

@larryliu0820 HF is moving towards organizing different backends in separate repo, from putting all backends in the central optimum. Now they have:

optimum-intel: https://github.com/huggingface/optimum-intel
optimum-amd: https://github.com/huggingface/optimum-amd
optimum-neuron: https://github.com/huggingface/optimum-neuron
optimum-nvidia: https://github.com/huggingface/optimum-nvidia
And a brand new repo for ExecuTorch, optimum-executorch: https://github.com/huggingface/optimum-executorch

The modeling code rewriting can not be avoided not only for ExecuTorch backends. Taking optimum-intel for example, it's rewriting the original transforms in:

And many more. The modeling code rewriting for ONNXRuntime is here: https://github.com/huggingface/optimum/blob/main/optimum/onnxruntime/modeling_decoder.py To me it seems to be an ideal place for patching the original modeling code, creating recipes from the HF Transformers to be tailored to a specific backend. cc: @mergennachin

kimishpatel Feb 14, 2025
Collaborator

Thank you @guangy10 for detailed write up and I ended up reading a bit on HF transformers here https://huggingface.co/docs/transformers/add_new_model.

I generally agree, having read through, that if we provide composable components in ExecuTorch in order to enable different variants of transformer models then we are just sort of recreating what HF transformers is.
Not only that we are further away from leveraging same model definition that tune has. So this direction probably needs heavy reconsideration.

Evidence on accelerator or backend specific optimum repo suggests that thats the place for doing things of this sorts. Especially if you are targeting HF users.

On the philosophy of single model single file versus reusable modules, tune has opted for the latter. Although I cannot say which one is truly better or suites our needs.

My opinion would be that we should not host model definition in executorch in the long term. A challenge we have though is that we need to be able to support model definitions coming from HF (or via optimum-executorch) as well as tune. We have internal customers leveraging tune as well. So whatever our eventual solution is should try to address both of these clientele and maybe we need two different solutions.

So the challenges I see are

Supporting HF models via optimum-executorch
Supporting tune model via what? optimum-executorch as well?

A common denominator might be that we host optimized attentions, kv cache etc. modules and we probably gonna have to scrub say at least 5-10 transformer, maybe decoder only to start with, models and make sure our optimized modules actually work with these models. I have very little signal on this. What about Tune's model definition then, would it work there? Lets say we figure this part out.

Then the question is if we need significant model rewrites where should those be hosted? optimum-executorch does sound like a good place as with other repos there. But can we address the need to be compatible with tune's model definition? This part seems non-trivial and I need to think a bit more but I did want to leave my early thoughts on this.

Once again thanks @guangy10 for bringing these up

iseeyuan · 2025-02-13T21:32:26Z

iseeyuan
Feb 13, 2025
Collaborator Author

Thanks for the feedback! I'm planning to revamp the RFC to highlight:

the ultimate goals of 1) shorten the model deployment time.
1. Improve the device-aware development.
focus on user-facing APIs like checkpoint loading and hide details on how we maintain backend-friendly definitions.

0 replies

iseeyuan · 2025-02-26T21:16:57Z

iseeyuan
Feb 26, 2025
Collaborator Author

Okay, Updated the discussion to V2. Thanks again for your comments. Please take a look and let me know if you have further comments! cc @GregoryComer @kimishpatel @cccclai @digantdesai @guangy10 @larryliu0820 @jackzhxng

0 replies

iseeyuan · 2025-05-01T00:53:19Z

iseeyuan
May 1, 2025
Collaborator Author

Updated the discussion to V3 with further clarifications.

0 replies

[RFC] etLLM: LLM via ExecuTorch #8228

Uh oh!

Uh oh!

iseeyuan Feb 5, 2025 Collaborator

tl;dr

Context

Too many options and error prone

Complicated building command

Missing parts that other frameworks have

Scattered codes

Works for models with Llama architecture only

Types of new LLM models

RFC

Entry points

export_llm

Runtime

Relationship between etLLM and other libraries like HF optimum-executorch

Replies: 7 comments · 20 replies

Uh oh!

Uh oh!

GregoryComer Feb 6, 2025 Collaborator

Uh oh!

kimishpatel Feb 11, 2025 Collaborator

Uh oh!

cccclai Feb 11, 2025 Collaborator

Uh oh!

kimishpatel Feb 11, 2025 Collaborator

Uh oh!

iseeyuan Feb 12, 2025 Collaborator Author

Uh oh!

cccclai Feb 12, 2025 Collaborator

Uh oh!

kimishpatel Feb 11, 2025 Collaborator

Uh oh!

iseeyuan Feb 12, 2025 Collaborator Author

Uh oh!

kimishpatel Feb 11, 2025 Collaborator

Uh oh!

iseeyuan Feb 12, 2025 Collaborator Author

Uh oh!

digantdesai Feb 21, 2025 Collaborator

Uh oh!

guangy10 Feb 12, 2025 Collaborator

Uh oh!

guangy10 Feb 12, 2025 Collaborator

Uh oh!

larryliu0820 Feb 13, 2025 Collaborator

Uh oh!

jackzhxng Feb 13, 2025 Collaborator

Uh oh!

Uh oh!

guangy10 Feb 14, 2025 Collaborator

Uh oh!

kimishpatel Feb 14, 2025 Collaborator

Uh oh!

iseeyuan Feb 13, 2025 Collaborator Author

Uh oh!

iseeyuan Feb 26, 2025 Collaborator Author

Uh oh!

iseeyuan May 1, 2025 Collaborator Author

iseeyuan
Feb 5, 2025
Collaborator

Replies: 7 comments 20 replies

GregoryComer
Feb 6, 2025
Collaborator

kimishpatel Feb 11, 2025
Collaborator

cccclai Feb 11, 2025
Collaborator

kimishpatel Feb 11, 2025
Collaborator

iseeyuan Feb 12, 2025
Collaborator Author

cccclai Feb 12, 2025
Collaborator

kimishpatel
Feb 11, 2025
Collaborator

iseeyuan Feb 12, 2025
Collaborator Author

kimishpatel
Feb 11, 2025
Collaborator

iseeyuan Feb 12, 2025
Collaborator Author

digantdesai Feb 21, 2025
Collaborator

guangy10
Feb 12, 2025
Collaborator

guangy10 Feb 12, 2025
Collaborator

larryliu0820 Feb 13, 2025
Collaborator

jackzhxng Feb 13, 2025
Collaborator

guangy10 Feb 14, 2025
Collaborator

kimishpatel Feb 14, 2025
Collaborator

iseeyuan
Feb 13, 2025
Collaborator Author

iseeyuan
Feb 26, 2025
Collaborator Author

iseeyuan
May 1, 2025
Collaborator Author