-
Notifications
You must be signed in to change notification settings - Fork 35
LM Toolkit Refactor #381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 2.0.0
Are you sure you want to change the base?
LM Toolkit Refactor #381
Conversation
TODO: finish processing script, integrate LLM
…ed in that package now
… textpredict package
Merging changes from removing unigram model
Merging unigram removal into toolkit refactor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll wait for @lawhead to review since he has more experience here. I would keep the base class as LanguageModel
or BciPyLanguageModel
, but the others could be annotated as Adapters extending from that. We may want our own Uniform here without an adapter. I understand why you need it in the toolkit, but it's simple enough to keep here, and it could be a good example of how to build an LM in BciPy.
The toolkit doesn't seem to work for 3.10.6. >=3.7,<3.11?
Also, some linting errors!
Coverage summary from CodacySee diff coverage on Codacy
Coverage variation details
Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: Diff coverage details
Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: See your quality gate settings Change summary preferencesCodacy stopped sending the deprecated coverage status on June 5th, 2024. Learn more Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all of your effort on this PR. I like that it moves the details important for developing and evaluating language models into a separate space that can evolve independently from BciPy. However, there are a few things I would like to see implemented differently. Some of this feedback is detailed so let me know if you want to setup a meeting to discuss.
- Language models have always been an important component of BciPy and we need to retain that priority while more formally specifying an API. While the textpredict library provides most of the language models used in BciPy, it does not provide all of them (ex. Oracle LM). We also need to leave open the potential for users to bring their own models. So we still need to retain a
LanguageModel
class to specify what is required of a language model for use in BciPy.
Now that we are using Python 3.8+, we have a few more options than when this code was originally written. Rather than tightly coupling this with textpredict and using the LanguageModel
ABC class in that library, I propose creating a LanguageModel Protocol (https://peps.python.org/pep-0544) in BciPy. The textpredict library can still maintain its own LanguageModel base class, and all models would implicitly implement this protocol.
from typing import Literal, Protocol
class LanguageModel(Protocol):
def predict(self, evidence: List[str]) -> List[Tuple[str, float]]:
...
def configure(self, params: Dict[str, Any]) -> None:
"""Configure the language model. Assumes a no-arg constructor.
See below regarding parameters."""
# Tasks don't use word prediction yet, so maybe it's still optional and not included in the protocol.
def set_response_type(self, response_type: Literal['symbol', 'word']):
...
-
Many of the adapters have similar code for handling spaces and backspaces. Is it possible to have a single adapter for all textpredict models?
-
Regarding language model parameters, I would prefer to establish a different mechanism for passing parameters to language models than using lm_params, which seems specific to what's currently in textpredict and may easily get out of sync. Maybe we have another value in parameters.json for this that is a serialized json string.
"lm_params": {
"value": "{}",
"section": "lang_model_config",
"name": "Language Model Parameters",
"helpTip": "Parameters passed to the selected language model.",
"recommended": [
],
"editable": false,
"type": "str"
}
-
The language_model helper currently depends on importing LanguageModel subclasses to know what's available. This mechanism should be re-worked to be a registry allowing other models to be included. This is a lower priority and can be pushed to a subsequent ticket.
-
I agree with Tab that BciPy should have its own Uniform LM.
…col. Adjusted subclasses and references
I've addressed 1. and 5. with the previous two commits. Regarding 2., I think that it is theoretically possible, but I worry that it would get far too messy with each model type requiring different parameters to initialize. I think having them all separate is likely cleaner. As it stands, the majority of the adapters inherit the same predict method. I considered moving some of the symbol set modifications from the model init methods into the super class init method, but now that it is a protocol, it might not be necessary/wanted for models in BciPy that aren't actually adapters.
I agree that 4. should be a separate ticket. I took a quick stab at doing this and it seems that there's an extra layer of complication because BciPyLanguageModel is a Protocol as well. |
bcipy/language/main.py
Outdated
@@ -18,26 +18,28 @@ def __str__(self): | |||
return self.value | |||
|
|||
|
|||
class LanguageModel(ABC): | |||
"""Parent class for Language Models.""" | |||
class BciPyLanguageModel(Protocol): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should just be LanguageModel
. It's already name-spaced in the bcipy package. As far as I can tell we never import the textpredict LanguageModel
base class, but if we did we can use an import alias (https://docs.python.org/3/reference/simple_stmts.html#import) to disambiguate.
class LanguageModel(ABC): | ||
"""Parent class for Language Models.""" | ||
class LanguageModel(Protocol): | ||
"""Protocol for BciPy Language Models.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The LanguageModel Protocol is used for defining an interface (or contract) that any classes being used by our code must implement. It is intended for typing code that uses language models. It is not intended for code re-use. Implementing classes don't need to subclass the Protocol.
This interface should be minimal and only the methods that are used by the calling code. Also, protocol methods shouldn't have a body (just use ...
). See my earlier comment regarding what the implementation could look like.
If you need code reuse for the adapters you could have a common LanguageModelAdapter parent class or use a mixin approach.
With Protocols we need to change the language_model registry process. It currently depends on LanguageModel.__subclasses__
, which is a fairly brittle pattern and won't work with structural subtyping. The simplest change for this PR would be hard code the supported models in the language_models_by_name
function. Then in the init_language_model
function models should be instantiated using an empty constructor and configured using the protocol methods. I'm happy to work on a followup PR for registration.
If you want to work through any of this over a call let me know.
Merging toolkit refactor into Banff LM branch for sim testing.
Overview
Replaced all custom models in the language module with language model adapters. Adapters rely on aactextpredict, our new LM toolkit, for the heavy lifting and only need to handle BciPy-specific things like special space and backspace characters and response type properties.
Ticket
Link a pivotal ticket here
Contributions
Test
Documentation
Changelog