Skip to content

LM Toolkit Refactor #381

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 55 commits into
base: 2.0.0
Choose a base branch
from
Open

LM Toolkit Refactor #381

wants to merge 55 commits into from

Conversation

dcgaines
Copy link
Collaborator

@dcgaines dcgaines commented Mar 7, 2025

Merging toolkit refactor into Banff LM branch for sim testing.

Overview

Replaced all custom models in the language module with language model adapters. Adapters rely on aactextpredict, our new LM toolkit, for the heavy lifting and only need to handle BciPy-specific things like special space and backspace characters and response type properties.

Ticket

Link a pivotal ticket here

Contributions

  • Deprecated LanguageModel classes in favor of LanguageModelAdapter classes.
  • Consolidated predict methods into the super class, only override when needed (Oracle).
  • Renamed KenLM model to NGram to match the aactextpredict package.
  • Updated all references to KenLM and LanguageModel classes to match new names/classes

Test

  • Ran all pytest cases

Documentation

  • Language module README updated. Added links to textpredict repo and AAC adapting arXiv paper.

Changelog

  • Is the CHANGELOG.md updated with your detailed changes?
  • Not yet.

tab-cmd and others added 30 commits January 8, 2025 12:43
TODO: finish processing script, integrate LLM
Copy link
Contributor

@tab-cmd tab-cmd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll wait for @lawhead to review since he has more experience here. I would keep the base class as LanguageModel or BciPyLanguageModel, but the others could be annotated as Adapters extending from that. We may want our own Uniform here without an adapter. I understand why you need it in the toolkit, but it's simple enough to keep here, and it could be a good example of how to build an LM in BciPy.

The toolkit doesn't seem to work for 3.10.6. >=3.7,<3.11?

Also, some linting errors!

@tab-cmd tab-cmd deleted the branch 2.0.0 March 20, 2025 08:44
@tab-cmd tab-cmd closed this Mar 20, 2025
@tab-cmd tab-cmd reopened this Mar 20, 2025
@tab-cmd tab-cmd changed the base branch from BANFF_lm to 2.0.0 March 20, 2025 08:53
Copy link

codacy-production bot commented Mar 24, 2025

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation Diff coverage
Report missing for aa216ca1 75.53% (target: 10.00%)
Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (aa216ca) Report Missing Report Missing Report Missing
Head commit (de49406) 8831 5952 67.40%

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#381) 94 71 75.53%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings    Change summary preferences

Codacy stopped sending the deprecated coverage status on June 5th, 2024. Learn more

Footnotes

  1. Codacy didn't receive coverage data for the commit, or there was an error processing the received data. Check your integration for errors and validate that your coverage setup is correct.

Copy link
Collaborator

@lawhead lawhead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all of your effort on this PR. I like that it moves the details important for developing and evaluating language models into a separate space that can evolve independently from BciPy. However, there are a few things I would like to see implemented differently. Some of this feedback is detailed so let me know if you want to setup a meeting to discuss.

  1. Language models have always been an important component of BciPy and we need to retain that priority while more formally specifying an API. While the textpredict library provides most of the language models used in BciPy, it does not provide all of them (ex. Oracle LM). We also need to leave open the potential for users to bring their own models. So we still need to retain a LanguageModel class to specify what is required of a language model for use in BciPy.

Now that we are using Python 3.8+, we have a few more options than when this code was originally written. Rather than tightly coupling this with textpredict and using the LanguageModel ABC class in that library, I propose creating a LanguageModel Protocol (https://peps.python.org/pep-0544) in BciPy. The textpredict library can still maintain its own LanguageModel base class, and all models would implicitly implement this protocol.

from typing import Literal, Protocol

class LanguageModel(Protocol):

  def predict(self, evidence: List[str]) -> List[Tuple[str, float]]:
    ...
    
  def configure(self, params: Dict[str, Any]) -> None:
    """Configure the language model. Assumes a no-arg constructor.
       See below regarding parameters."""

  # Tasks don't use word prediction yet, so maybe it's still optional and not included in the protocol.
  def set_response_type(self, response_type: Literal['symbol', 'word']):
    ...
  1. Many of the adapters have similar code for handling spaces and backspaces. Is it possible to have a single adapter for all textpredict models?

  2. Regarding language model parameters, I would prefer to establish a different mechanism for passing parameters to language models than using lm_params, which seems specific to what's currently in textpredict and may easily get out of sync. Maybe we have another value in parameters.json for this that is a serialized json string.

  "lm_params": {
    "value": "{}",
    "section": "lang_model_config",
    "name": "Language Model Parameters",
    "helpTip": "Parameters passed to the selected language model.",
    "recommended": [
    ],
    "editable": false,
    "type": "str"
  }
  1. The language_model helper currently depends on importing LanguageModel subclasses to know what's available. This mechanism should be re-worked to be a registry allowing other models to be included. This is a lower priority and can be pushed to a subsequent ticket.

  2. I agree with Tab that BciPy should have its own Uniform LM.

@dcgaines
Copy link
Collaborator Author

I've addressed 1. and 5. with the previous two commits.

Regarding 2., I think that it is theoretically possible, but I worry that it would get far too messy with each model type requiring different parameters to initialize. I think having them all separate is likely cleaner. As it stands, the majority of the adapters inherit the same predict method. I considered moving some of the symbol set modifications from the model init methods into the super class init method, but now that it is a protocol, it might not be necessary/wanted for models in BciPy that aren't actually adapters.

  1. I think that this would get very messy very quickly. Some of the models have several parameters that they require, which would make for a very long serialized json that wouldn't be very readable. Also, changing to a single parameter like this would remove the current ability to have default values for each type of model.

I agree that 4. should be a separate ticket. I took a quick stab at doing this and it seems that there's an extra layer of complication because BciPyLanguageModel is a Protocol as well.

@dcgaines dcgaines requested review from tab-cmd and lawhead March 28, 2025 17:48
@@ -18,26 +18,28 @@ def __str__(self):
return self.value


class LanguageModel(ABC):
"""Parent class for Language Models."""
class BciPyLanguageModel(Protocol):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should just be LanguageModel. It's already name-spaced in the bcipy package. As far as I can tell we never import the textpredict LanguageModel base class, but if we did we can use an import alias (https://docs.python.org/3/reference/simple_stmts.html#import) to disambiguate.

class LanguageModel(ABC):
"""Parent class for Language Models."""
class LanguageModel(Protocol):
"""Protocol for BciPy Language Models."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LanguageModel Protocol is used for defining an interface (or contract) that any classes being used by our code must implement. It is intended for typing code that uses language models. It is not intended for code re-use. Implementing classes don't need to subclass the Protocol.

This interface should be minimal and only the methods that are used by the calling code. Also, protocol methods shouldn't have a body (just use ...). See my earlier comment regarding what the implementation could look like.

If you need code reuse for the adapters you could have a common LanguageModelAdapter parent class or use a mixin approach.

With Protocols we need to change the language_model registry process. It currently depends on LanguageModel.__subclasses__, which is a fairly brittle pattern and won't work with structural subtyping. The simplest change for this PR would be hard code the supported models in the language_models_by_name function. Then in the init_language_model function models should be instantiated using an empty constructor and configured using the protocol methods. I'm happy to work on a followup PR for registration.

If you want to work through any of this over a call let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants