Replies: 4 comments 15 replies
-
@iseeyuan Do you think it would be possible to get to a point where we can meet model definitions where they are - even if we're maybe getting 50%-80% of theoretical peak performance? The reason I ask is that there is a significant burden involved in writing the model the way that ET expects. If we could provide a "peak performance path" and an "out of box / just works" path, that would be very nice. My experience with some internal teams is that they try to run existing language models on ET and drop it when the out of box performance is far behind what they expect. Aside from that, I think standardizing the language model APIs will be a big win for usability. Thanks for putting this together. |
Beta Was this translation helpful? Give feedback.
-
meta comment: you need to format the RFC a bit. Indentation and spacing is off that it makes it a hard read |
Beta Was this translation helpful? Give feedback.
-
@iseeyuan I feel that you should split this RFC into two. One for model architecture definition and one for export_llama_lib refactoring. |
Beta Was this translation helpful? Give feedback.
-
To what extent is this true? If someone doesn't have permission to rewrite the modeling code, does that mean the model won't work for that backend at all? Or will it still work but just not achieve the best performance? Maybe @kimishpatel @digantdesai @cccclai can comment about it?
I agree that we should provide tools to make source code rewriting easier. However, I don’t think rewriting should always be done in the original modeling code, as this could impact dozens of models and make OSS contributions increasingly difficult as more models are covered. (This is how I interpret the proposal, given its goal of unifying code and reducing boilerplate.) For example, as a user, if the code I modify could affect the performance of numerous models and use cases, I’d be hesitant to make changes and would likely defer to ET developers instead. This not only places us at the center of enablement and improvement but also increases the risk of making contributions more intimidating.
Many of the proposed ideas already exist in HF. For example, the ability to add and register different attention implementations is already supported (pointer). Additionally, the lifted cache is already exported as IO in Exported IR (example). My impression is that this proposal is leaning toward consolidating HF Transformers' definitions and own it in our repo, aiming to support as many transformer models as possible—including text, audio, and vision transformers. Can this approach scale effectively? One of the core principles HF Transformers upholds is "single model, single file" (as mentioned at PTC 2024). I believe they are fully aware of the downside of this approach—namely, redundant code—but it provides significant flexibility in isolating performance impacts across models and reduces the complexity of performance testing. So far, this strategy has proven highly successful.
I want to second this. Some ML engineers who just want to prototype quickly in Python shouldn’t need to be aware of the runtime code (C++). Take HF workflow as an example—good UX means an ML engineer should be able to experiment with different models and recipes, validating end-to-end in Python without needing any knowledge of the underlying runtime. This requires the interface to runtime(s) to be not only backend-agnostic but also model-agnostic.
Back to the key problem highlighted in this proposal, having multiple modeling sources in our repo is indeed a challenge, but is having multiple modeling sources itself a problem? I see these as two distinct issues, and the latter doesn’t seem avoidable—it will happen somewhere regardless.
You mean decoder-only transformers, right? What about encoder-only transformers (like BERT) and encoder-decoder transformers (like T5)? What’s the plan for non-transformer models, such as diffusion models or Timm models? If we’re heading down this path, I think we need to consider the full picture. Q: Should the ExecuTorch repo serve as a recipe repository? If so, how many recipes do you expect to host in the ExecuTorch repo? This proposal seems to imply that the ExecuTorch repo will also function as a recipe repository. I agree that providing a default recipe for each backend makes sense. However, that alone doesn’t justify the need to host these recipes within ExecuTorch. Some of the proposed ideas, such as controlling recipes via a configuration file, are already well-supported by Hugging Face not just for eager, but also for ONNX and TFLite. Why is it necessary to rebuild a similar mechanism and maintain it in our repo? From the perspective of building a vibrant community, I think it is key that recipes are separated from the core. While we can offer a default recipe for each backend as an option, we shouldn’t restrict users to copying and customizing them for their own needs. To encourage organic community growth, users should be able to create as many recipes as they want and make recipes shareable so that other OSS users can benefit. This level of openness wouldn’t be possible if recipes were tightly coupled in our repo. |
Beta Was this translation helpful? Give feedback.
-
tl;dr
When supporting internal and open source projects, we realized that there are a number of transformer definitions, export flows and runtimes existing in both OSS and internal repos.
It may cause some confusion and complications for both internal users and the OSS community.
With more use cases (multi-turn, multimodal, etc.) and more models being supported, the number of those versions may grow and introduce frictions in deployment.
This discussion is to collect all the versions currently used by ExecuTorch, analyze the differences behind all the versions, and have some design thoughts and questions.
Context
Popular LLMs share similar transformer based architectures. The fixed architecture brings some convenience on deployment. An example is llama.cpp. The framework is built based on the architecture, and sets contracts on loading and running the weight files. However, when deploying to a variety of backends, the flows can be different due to different limitations of the backends. Those limitations include
Static shapes vs. dynamic shapes
Static quantization vs. dynamic quantization
Data types a backend supports
Kernels available in a backend
Different types of the attention layer
Sometimes, updating the exporting recipe is not sufficient or efficient. Supporting a specific backend may involve a different copy of model definition, and a different version of runtime code. On the other hand, we can see the potential trend of scale due to more use cases and models to be supported. Adding those new models up, and multiplying them with the number of backends, and use cases to support, the final versions can explode.
Below is a table to summarize those existing code versions, their unique properties and use cases.
RFC
With a limited number of models and backends, it may be faster to make a copy of a model and backend, and develop based on that copy. However, when the numbers scale, there can be some redundant work.
Instead, a lot of time-consuming work can be saved, and boiler codes can be avoided with a cleaner structure.
The above diagrams are for llama-like models. Some notes:
It can be extended to general transformer-like models, with the ET-friendly llama.py as the single entry point, with different implementations as necessary. For example, with this single entry point, there can be different attention implementations with a simple registration. Please refer to the Eager mode definition RFC below.
When we see a clear and generic trend, the codes can be moved from examples folder to extension/llm.
With the improvements of export capabilities and backend support, ideally an arbitrary eager mode definition can be exported and lowered to any backend. However, we don’t see the feasibility in the near future.
Eager mode definition
[RFC1] Provide tools to help the source code rewrite. Options are listed below:
Source-level transforms. Good for code unification; Not straightforward for readability. Cannot be used in all situations like different APIs.
Weight mapping. For example, using torchtune utils to convert HF safetensor weights to PyTorch checkpoint. Example here.
Convert the configs: Qwen, deepseek, or easy ways to build those models using the existing components.
There are two layers of rewrite:
For general purpose or exportability. For example,
We may wrap unbacked Int to a tensor
Slicing to make it exportable for (https://github.com/pytorch/pytorch/issues/120288)
For a specific backend. For example, KV Cache as IO for QNN and CoreML.
For the first layer, we should unify and reuse the same code to avoid redundant work. For the second layer,
[RFC2] For the second layer of rewrite (backend specific), we are seeking for code sharing to avoid redundant work.
Attention layer is where the biggest variety comes from for different Transformers. For example, KV cache as IO vs attribute. KV Cache as IO is compatible with all other backends as well, and maybe easier for a unified KV cache management code in runtime. Currently in CPU we are using mutable state as shown in Jacob’s post. NPUs are using KV Cache as IO (static_llama and reference_llama). Some options to support this feature:
[Preferred] Most differences happen at the attention layer. Add an attention interface. Different attentions can be implemented and registered. Prototype PR
Pros: readability in eager, consistent to runtime logic. Other types of attention can be added and registered. For example, the Multi-head Latent Attention (MLA) used in DeepSeek. Users can easily find all supported attention.
Cons: one more option in multiple functions (transformer, block and attention), perf for large cache?
Merge all definitions to KV cache as IO, and keep a separate copy for KV cache as attribute.
Keep the lifted KV cache as IO in export IR
Pros: no eager mode change on this
Cons: hidden from the user, no direct access to user (used by customized cache management that usually from eager mode)
Use input_pos optionally. No-input_pos is not just for backend limitations. If it’s a one-shot use case like classification (arbitration), or potentially multimodal encoders, input_pos may not be necessary.
Static/dynamic KV Cache. Dynamic KV cache can be used to save avg memory but not peak memory. Since there are other ways like MLA to save KV cache size, dynamic KV Cache would be a lower priority for now.
Open questions
How is QAT handled?
users may want to do QAT on torchtune models since the infra is set up there.
If it’s eager-mode QAT (weight only) we can do a transform to the QAT submodule.
PT2E QAT: has to be in ET definition. Should we set up the QAT flow based on the ET transformer?
Export flows
export_llama looks over-complicated with all the command options to handle different quantization schemes, different backends, etc. Inspired by our internal ModAI tool, as well as torchtunes configuration structure based on hydra:
[RFC] Have one configuration file/recipe for each backend.
What’s the format to host this recipe, is it a python script or a yaml file? A config file may have the advantage of simplicity (users don’t have to know the implementation details) and better version control, but may introduce more effort to maintain.
What’s the granularity of the recipes? If all configs are decoupled from the implementation, it may be more reasonable to have one implementation, like export_llama_cpu, but multiple config yamls for each target use case, like different quantization group sizes.
Modularize the code like checkpoints, quantization, etc.
Runtime
Runtime codes may need more simplification and unification, due to the complexity of maintaining and building the C++ codes.
[RFC] Runtime code should be as backend-agnostic as possible. Some features should be modularized and a library of those features to be provided.
Runtime code should be simple. The complex logic should be put into the model if possible, for reasons below:
Scalability: C++ codes reusable through operators. Don’t need to maintain multiple C++ files for multiple models.
Portability: no need to sync C++ files in two repos in dev stage (like in ExecuTorch and torchchat).
Better UX: it’s easier for users to integrate the model inference to their use cases and less error prone.
There is a strong need for runtime components. For example,
KV Cache management. We should modularize those and provide them in a library.
When the user logic gets more complicated, a local data container to efficiently and safely store/retrieve the data would be necessary. Our existing MLDW may help here.
cc @mergennachin @kimishpatel @tarun292 @jackzhxng @cccclai @larryliu0820 @guangy10 @billmguo @sxu @Andriyluck @madhu-fb
Beta Was this translation helpful? Give feedback.
All reactions